P2 Worked Example — Monitoring Framework Entry: Vertrag.AI

Project: Pickles GmbH — AI Governance Framework Stage: Phase 2 — Worked Example Document: P2-Worked-Example-Monitoring-Entry-v1.md Status: Draft Version: v1 Date: 2026-02-28 Assumptions: Built on Phase 2 fictional system design — Vertrag.AI does not exist. All metrics, thresholds, and dashboard designs are illustrative. Requires population with real operational data before use.

About This Document

This document populates the Pickles GmbH AI Monitoring Framework (L3-6.1) with a fully worked entry for Vertrag.AI. It defines the specific metrics to be tracked, how they are measured, what thresholds trigger action, the review cadence, and the dashboard design for this system.

The monitoring design reflects Vertrag.AI's internal classification as Medium-High risk, professional liability exposure, and the particular challenge of monitoring an AI system that processes legally privileged material — which limits what can be logged.

1. System Reference

Field	Entry
System	Vertrag.AI
System ID	PKL-SYS-003
Internal risk tier	Medium-High
Monitoring tier	Active — quarterly formal review; continuous automated monitoring
Monitoring owner	Head of Product
Technical monitoring lead	Head of Engineering
Entry last reviewed	2026-02-28 (initial)

2. Monitoring Constraints

Before defining metrics, the key constraint shaping Vertrag.AI's monitoring design must be stated clearly.

The GDPR/confidentiality constraint: Contract documents uploaded to Vertrag.AI are legally privileged material and contain personal data. Full contract text, full prompt content, and matter-identifying information cannot be retained in monitoring logs. This means standard LLM monitoring techniques — such as reviewing full prompt/completion pairs for quality — are not available.

This constraint is inherent to the legal market context and cannot be resolved without compromising the confidentiality obligations that make the product viable. The monitoring design works around it through:

Metadata-only logging — events and outcomes are logged without content
Sampled review — a small, consented sample of sessions used for quality testing
Outcome-based signals — lawyer accept/reject patterns as a proxy for output quality
User-reported signals — structured error and concern reporting by lawyers

All monitoring design decisions below reflect this constraint.

3. Metrics Register

3.1 Output Quality Metrics

Metric ID	Metric Name	Description	Measurement Method	Frequency	Target	Alert Threshold
MQ-01	Suggestion acceptance rate	Percentage of AI-generated redline suggestions accepted by reviewing lawyers without modification	Derived from review action logs (accept/reject/modify counts per session)	Continuous; weekly aggregate	≥60% acceptance [ASSUMPTION: baseline to be established in first 90 days]	<40% acceptance in any rolling 7-day window
MQ-02	Suggestion modification rate	Percentage of accepted suggestions that were modified before acceptance	Derived from review action logs	Weekly aggregate	Informational only — no threshold initially	Significant change from baseline (>±15%)
MQ-03	Clause identification miss rate (user-reported)	Number of sessions where lawyer reports a significant clause issue was missed by the system	User-reported via in-product feedback mechanism	Per report; monthly aggregate	0 per month [ASSUMPTION]	Any single report triggers review; ≥3 in a month triggers incident
MQ-04	Citation accuracy (sampled)	Percentage of specific legal provision citations (BGB articles, AGB standards) that are accurate when spot-checked	Quarterly sampled review by legal content team — minimum 20 sessions reviewed	Quarterly	≥95% accuracy on cited provisions	<90% triggers urgent review and potential system suspension
MQ-05	False positive rate (sampled)	Percentage of suggestions in sampled sessions assessed as spurious by reviewing lawyers	Quarterly sampled review	Quarterly	≤15% of suggestions assessed as spurious	>25% triggers product review

Note on sampling: Quarterly sampled reviews (MQ-04, MQ-05) require law firm users to opt in to having a session reviewed by Pickles GmbH's legal content team. Session content reviewed under sampling must be subject to confidentiality obligations equivalent to those governing the original matter. Consent and confidentiality framework for sampling must be established before this metric can be collected. [ASSUMPTION: consent framework does not yet exist — flag for Legal Counsel.]

3.2 System Performance Metrics

Metric ID	Metric Name	Description	Measurement Method	Frequency	Target	Alert Threshold
SP-01	API error rate	Percentage of Anthropic API calls that return an error or timeout	Application error logs	Continuous	<1% error rate	>3% in any 1-hour window
SP-02	Processing latency (p95)	95th percentile end-to-end processing time from upload to redlined output available	Application performance logs	Continuous	<120 seconds [ASSUMPTION]	>300 seconds sustained for >30 minutes
SP-03	RAG retrieval failure rate	Percentage of queries where RAG layer returns zero relevant results	RAG system logs	Continuous	<2% of queries	>5% in any 24-hour window
SP-04	Document processing failure rate	Percentage of uploaded documents that fail to parse or process	Application error logs	Continuous	<0.5%	>2% triggers engineering review
SP-05	System availability	Uptime percentage measured against agreed SLA	Infrastructure monitoring	Continuous	≥99.5% monthly [ASSUMPTION]	Any unplanned downtime >2 hours triggers incident log entry

3.3 Usage and Adoption Metrics

These metrics are informational — they support product understanding and help contextualise quality signals rather than triggering direct governance actions.

Metric ID	Metric Name	Description	Frequency
UA-01	Active law firm accounts	Number of firms with at least one session in the period	Monthly
UA-02	Sessions per firm	Average sessions per active firm per month	Monthly
UA-03	Contract types processed	Distribution of contract categories processed (commercial, employment, real estate, etc.)	Quarterly
UA-04	Rejection-only sessions	Sessions where the lawyer rejected all AI suggestions — potential indicator of poor output or inappropriate use case	Monthly
UA-05	Pre-signature vs post-signature split	Proportion of sessions in each use case	Monthly

3.4 User-Reported Signal Metrics

Metric ID	Metric Name	Description	Measurement Method	Frequency	Alert Threshold
UR-01	User error reports	Structured reports submitted by lawyers via in-product feedback on output quality issues	In-product report form; tracked in issue log	Per report; monthly aggregate	Any report categorised as "material error with client impact" triggers incident review
UR-02	Support tickets related to output quality	Help desk tickets where the reported issue relates to AI output accuracy or relevance	Support ticket tagging	Monthly	≥3 in a month triggers product review
UR-03	Net Promoter Score (NPS) — governance signal	NPS data filtered for comments mentioning accuracy, reliability, or trust	Quarterly NPS survey	Quarterly	Significant drop (>10 points) or cluster of negative accuracy comments triggers review

3.5 Model and Infrastructure Change Signals

Metric ID	Metric Name	Description	Measurement Method	Frequency
MC-01	Model version in use	Current Claude model version deployed	Deployment log	Continuous — logged on change
MC-02	RAG corpus version	Current version of German legal knowledge base	Knowledge base version log	Continuous — logged on update
MC-03	Post-change quality delta	Change in MQ-01 (acceptance rate) in 30 days following any model or corpus change	Derived from MQ-01 data, segmented by deployment date	Within 30 days of any change
MC-04	Prompt version	Current system prompt version in production	Deployment log	Continuous — logged on change

4. Review Cadence

Review Type	Frequency	Trigger	Participants	Output
Automated alert review	As triggered	Alert threshold breached (see Section 3)	Head of Engineering; on-call engineer	Incident log entry or resolution note
Weekly metrics review	Weekly	Standing	Head of Product	Dashboard review; flag anomalies
Monthly monitoring report	Monthly	Calendar	Head of Product; Head of Engineering	Written summary of MQ, SP, UR metrics; trends; any open issues
Quarterly formal review	Quarterly	Calendar	Head of Product; Head of Engineering; Legal Counsel	Full review of all metrics including sampled quality review (MQ-04, MQ-05); review of assumptions; update monitoring entry if thresholds need adjustment
Post-incident review	Within 14 days of any incident	Incident closure	Head of Product; Head of Engineering; Legal Counsel (if legal dimension)	Root cause analysis; corrective actions; monitoring update if gaps identified
Post-change review	30 days after model or corpus change	MC-03 trigger	Head of Product; Head of Engineering	Quality delta assessment; decision to maintain, adjust, or rollback

5. Dashboard Design

5.1 Dashboard Purpose and Audience

The Vertrag.AI monitoring dashboard provides a single-view summary of system health for the Head of Product and Head of Engineering. It is not a client-facing document. It feeds into the monthly monitoring report and quarterly formal review.

5.2 Dashboard Layout

┌─────────────────────────────────────────────────────────────────┐
│  VERTRAG.AI — MONITORING DASHBOARD          [Last updated: live] │
├─────────────────────────────────────────────────────────────────┤
│  SYSTEM STATUS          │  MODEL VERSION      │  CORPUS VERSION  │
│  ● OPERATIONAL          │  claude-sonnet-4-6  │  v1.2 (Feb 2026) │
├───────────────────────────────────────┬─────────────────────────┤
│  OUTPUT QUALITY                       │  SYSTEM PERFORMANCE      │
│                                       │                          │
│  Suggestion acceptance rate (7d)      │  API error rate (24h)    │
│  ████████████░░  64%  ✓               │  0.4%  ✓                 │
│                                       │                          │
│  User error reports (30d)             │  Processing latency p95  │
│  2 reports — 0 material  ✓            │  87s  ✓                  │
│                                       │                          │
│  Citation accuracy (last sample)      │  RAG retrieval failures  │
│  97%  ✓  [Q4 2025]                    │  1.1%  ✓                 │
│                                       │                          │
│  False positive rate (last sample)    │  System availability     │
│  11%  ✓  [Q4 2025]                    │  99.8%  ✓                │
├───────────────────────────────────────┴─────────────────────────┤
│  TREND — SUGGESTION ACCEPTANCE RATE (12 weeks)                   │
│                                                                  │
│  70% ┤                                                           │
│  65% ┤    ●   ●       ●   ●   ●                                  │
│  60% ┤●       ●   ●       ●       ●   ●                          │
│  55% ┤                                ●                          │
│  50% ┤                                                           │
│      └──────────────────────────────────────────────────────    │
│        W1  W2  W3  W4  W5  W6  W7  W8  W9  W10 W11 W12          │
│                                                 ↑ alert zone     │
├─────────────────────────────────────────────────────────────────┤
│  OPEN ALERTS          │  NEXT REVIEW                             │
│  None                 │  Monthly: 31 Mar 2026                    │
│                       │  Quarterly: 30 Apr 2026                  │
│                       │  Sampled review due: 30 Apr 2026         │
└─────────────────────────────────────────────────────────────────┘

[ASSUMPTION] Dashboard above is illustrative. Technical implementation (tooling, data connections, refresh frequency) to be confirmed by engineering team. Candidate tooling includes internal BI dashboard, Grafana, or a purpose-built product health screen within the Pickles GmbH operations interface.

5.3 Alert States

Status	Indicator	Meaning
✓	Green — within target	No action required
⚠	Amber — approaching threshold	Monitor closely; review at next scheduled meeting
✗	Red — threshold breached	Immediate review required; escalate per incident response playbook (L3-6.2)

6. Monitoring Gaps and Open Actions

#	Gap	Priority	Owner	Action
MON-001	Consent and confidentiality framework for sampled quality review (MQ-04, MQ-05) not yet established	High	Legal Counsel	Design consent model for sampling; obtain legal sign-off before quarterly sampled review begins
MON-002	Baseline acceptance rate (MQ-01 target) not yet established — no production data	High	Head of Product	Run 90-day baseline period before alert threshold is enforced; set threshold from baseline data
MON-003	In-product user error report mechanism not yet confirmed	Medium	Head of Product / Engineering	Confirm UR-01 and UR-02 reporting mechanisms are built into product
MON-004	Dashboard tooling not yet selected or implemented	Medium	Head of Engineering	Select tooling; implement initial dashboard before system goes live
MON-005	RAG corpus version control logging not yet confirmed	Medium	Head of Engineering	Confirm MC-02 logging is implemented; version history maintained

7. Connection to Other Governance Documents

Area	Document
Incident response (when thresholds are breached)	L3-6.2-Incident-Response-Playbook-v1.md
Model change management (MC-01 to MC-04)	L3-6.3-Model-Change-Management-Protocol-v1.md
Testing methodology (MQ-04, MQ-05 sampled reviews)	P2-Worked-Example-Technical-Documentation-v1.md Section 4
Human oversight design (MQ-01 to MQ-03 depend on)	L1-3.4-Human-Oversight-Policy-v1.md
GDPR logging constraints (Section 2)	L2-5.1-Data-Flow-Map-v1.md; L2-5.2-DPIA-Assessment-v1.md

This document is a fictional worked example produced for educational and demonstration purposes. Vertrag.AI does not exist. All metrics, thresholds, and dashboard designs are illustrative and do not constitute operational monitoring specifications. Professional review is required before any monitoring framework is applied operationally.