Skip to content

P2 Worked Example — Monitoring Framework Entry: Vertrag.AI

Project: Pickles GmbH — AI Governance Framework Stage: Phase 2 — Worked Example Document: P2-Worked-Example-Monitoring-Entry-v1.md Status: Draft Version: v1 Date: 2026-02-28 Assumptions: Built on Phase 2 fictional system design — Vertrag.AI does not exist. All metrics, thresholds, and dashboard designs are illustrative. Requires population with real operational data before use.


About This Document

This document populates the Pickles GmbH AI Monitoring Framework (L3-6.1) with a fully worked entry for Vertrag.AI. It defines the specific metrics to be tracked, how they are measured, what thresholds trigger action, the review cadence, and the dashboard design for this system.

The monitoring design reflects Vertrag.AI's internal classification as Medium-High risk, professional liability exposure, and the particular challenge of monitoring an AI system that processes legally privileged material — which limits what can be logged.


1. System Reference

Field Entry
System Vertrag.AI
System ID PKL-SYS-003
Internal risk tier Medium-High
Monitoring tier Active — quarterly formal review; continuous automated monitoring
Monitoring owner Head of Product
Technical monitoring lead Head of Engineering
Entry last reviewed 2026-02-28 (initial)

2. Monitoring Constraints

Before defining metrics, the key constraint shaping Vertrag.AI's monitoring design must be stated clearly.

The GDPR/confidentiality constraint: Contract documents uploaded to Vertrag.AI are legally privileged material and contain personal data. Full contract text, full prompt content, and matter-identifying information cannot be retained in monitoring logs. This means standard LLM monitoring techniques — such as reviewing full prompt/completion pairs for quality — are not available.

This constraint is inherent to the legal market context and cannot be resolved without compromising the confidentiality obligations that make the product viable. The monitoring design works around it through:

  • Metadata-only logging — events and outcomes are logged without content
  • Sampled review — a small, consented sample of sessions used for quality testing
  • Outcome-based signals — lawyer accept/reject patterns as a proxy for output quality
  • User-reported signals — structured error and concern reporting by lawyers

All monitoring design decisions below reflect this constraint.


3. Metrics Register

3.1 Output Quality Metrics

Metric ID Metric Name Description Measurement Method Frequency Target Alert Threshold
MQ-01 Suggestion acceptance rate Percentage of AI-generated redline suggestions accepted by reviewing lawyers without modification Derived from review action logs (accept/reject/modify counts per session) Continuous; weekly aggregate ≥60% acceptance [ASSUMPTION: baseline to be established in first 90 days] <40% acceptance in any rolling 7-day window
MQ-02 Suggestion modification rate Percentage of accepted suggestions that were modified before acceptance Derived from review action logs Weekly aggregate Informational only — no threshold initially Significant change from baseline (>±15%)
MQ-03 Clause identification miss rate (user-reported) Number of sessions where lawyer reports a significant clause issue was missed by the system User-reported via in-product feedback mechanism Per report; monthly aggregate 0 per month [ASSUMPTION] Any single report triggers review; ≥3 in a month triggers incident
MQ-04 Citation accuracy (sampled) Percentage of specific legal provision citations (BGB articles, AGB standards) that are accurate when spot-checked Quarterly sampled review by legal content team — minimum 20 sessions reviewed Quarterly ≥95% accuracy on cited provisions <90% triggers urgent review and potential system suspension
MQ-05 False positive rate (sampled) Percentage of suggestions in sampled sessions assessed as spurious by reviewing lawyers Quarterly sampled review Quarterly ≤15% of suggestions assessed as spurious >25% triggers product review

Note on sampling: Quarterly sampled reviews (MQ-04, MQ-05) require law firm users to opt in to having a session reviewed by Pickles GmbH's legal content team. Session content reviewed under sampling must be subject to confidentiality obligations equivalent to those governing the original matter. Consent and confidentiality framework for sampling must be established before this metric can be collected. [ASSUMPTION: consent framework does not yet exist — flag for Legal Counsel.]


3.2 System Performance Metrics

Metric ID Metric Name Description Measurement Method Frequency Target Alert Threshold
SP-01 API error rate Percentage of Anthropic API calls that return an error or timeout Application error logs Continuous <1% error rate >3% in any 1-hour window
SP-02 Processing latency (p95) 95th percentile end-to-end processing time from upload to redlined output available Application performance logs Continuous <120 seconds [ASSUMPTION] >300 seconds sustained for >30 minutes
SP-03 RAG retrieval failure rate Percentage of queries where RAG layer returns zero relevant results RAG system logs Continuous <2% of queries >5% in any 24-hour window
SP-04 Document processing failure rate Percentage of uploaded documents that fail to parse or process Application error logs Continuous <0.5% >2% triggers engineering review
SP-05 System availability Uptime percentage measured against agreed SLA Infrastructure monitoring Continuous ≥99.5% monthly [ASSUMPTION] Any unplanned downtime >2 hours triggers incident log entry

3.3 Usage and Adoption Metrics

These metrics are informational — they support product understanding and help contextualise quality signals rather than triggering direct governance actions.

Metric ID Metric Name Description Frequency
UA-01 Active law firm accounts Number of firms with at least one session in the period Monthly
UA-02 Sessions per firm Average sessions per active firm per month Monthly
UA-03 Contract types processed Distribution of contract categories processed (commercial, employment, real estate, etc.) Quarterly
UA-04 Rejection-only sessions Sessions where the lawyer rejected all AI suggestions — potential indicator of poor output or inappropriate use case Monthly
UA-05 Pre-signature vs post-signature split Proportion of sessions in each use case Monthly

3.4 User-Reported Signal Metrics

Metric ID Metric Name Description Measurement Method Frequency Alert Threshold
UR-01 User error reports Structured reports submitted by lawyers via in-product feedback on output quality issues In-product report form; tracked in issue log Per report; monthly aggregate Any report categorised as "material error with client impact" triggers incident review
UR-02 Support tickets related to output quality Help desk tickets where the reported issue relates to AI output accuracy or relevance Support ticket tagging Monthly ≥3 in a month triggers product review
UR-03 Net Promoter Score (NPS) — governance signal NPS data filtered for comments mentioning accuracy, reliability, or trust Quarterly NPS survey Quarterly Significant drop (>10 points) or cluster of negative accuracy comments triggers review

3.5 Model and Infrastructure Change Signals

Metric ID Metric Name Description Measurement Method Frequency
MC-01 Model version in use Current Claude model version deployed Deployment log Continuous — logged on change
MC-02 RAG corpus version Current version of German legal knowledge base Knowledge base version log Continuous — logged on update
MC-03 Post-change quality delta Change in MQ-01 (acceptance rate) in 30 days following any model or corpus change Derived from MQ-01 data, segmented by deployment date Within 30 days of any change
MC-04 Prompt version Current system prompt version in production Deployment log Continuous — logged on change

4. Review Cadence

Review Type Frequency Trigger Participants Output
Automated alert review As triggered Alert threshold breached (see Section 3) Head of Engineering; on-call engineer Incident log entry or resolution note
Weekly metrics review Weekly Standing Head of Product Dashboard review; flag anomalies
Monthly monitoring report Monthly Calendar Head of Product; Head of Engineering Written summary of MQ, SP, UR metrics; trends; any open issues
Quarterly formal review Quarterly Calendar Head of Product; Head of Engineering; Legal Counsel Full review of all metrics including sampled quality review (MQ-04, MQ-05); review of assumptions; update monitoring entry if thresholds need adjustment
Post-incident review Within 14 days of any incident Incident closure Head of Product; Head of Engineering; Legal Counsel (if legal dimension) Root cause analysis; corrective actions; monitoring update if gaps identified
Post-change review 30 days after model or corpus change MC-03 trigger Head of Product; Head of Engineering Quality delta assessment; decision to maintain, adjust, or rollback

5. Dashboard Design

5.1 Dashboard Purpose and Audience

The Vertrag.AI monitoring dashboard provides a single-view summary of system health for the Head of Product and Head of Engineering. It is not a client-facing document. It feeds into the monthly monitoring report and quarterly formal review.

5.2 Dashboard Layout

┌─────────────────────────────────────────────────────────────────┐
│  VERTRAG.AI — MONITORING DASHBOARD          [Last updated: live] │
├─────────────────────────────────────────────────────────────────┤
│  SYSTEM STATUS          │  MODEL VERSION      │  CORPUS VERSION  │
│  ● OPERATIONAL          │  claude-sonnet-4-6  │  v1.2 (Feb 2026) │
├───────────────────────────────────────┬─────────────────────────┤
│  OUTPUT QUALITY                       │  SYSTEM PERFORMANCE      │
│                                       │                          │
│  Suggestion acceptance rate (7d)      │  API error rate (24h)    │
│  ████████████░░  64%  ✓               │  0.4%  ✓                 │
│                                       │                          │
│  User error reports (30d)             │  Processing latency p95  │
│  2 reports — 0 material  ✓            │  87s  ✓                  │
│                                       │                          │
│  Citation accuracy (last sample)      │  RAG retrieval failures  │
│  97%  ✓  [Q4 2025]                    │  1.1%  ✓                 │
│                                       │                          │
│  False positive rate (last sample)    │  System availability     │
│  11%  ✓  [Q4 2025]                    │  99.8%  ✓                │
├───────────────────────────────────────┴─────────────────────────┤
│  TREND — SUGGESTION ACCEPTANCE RATE (12 weeks)                   │
│                                                                  │
│  70% ┤                                                           │
│  65% ┤    ●   ●       ●   ●   ●                                  │
│  60% ┤●       ●   ●       ●       ●   ●                          │
│  55% ┤                                ●                          │
│  50% ┤                                                           │
│      └──────────────────────────────────────────────────────    │
│        W1  W2  W3  W4  W5  W6  W7  W8  W9  W10 W11 W12          │
│                                                 ↑ alert zone     │
├─────────────────────────────────────────────────────────────────┤
│  OPEN ALERTS          │  NEXT REVIEW                             │
│  None                 │  Monthly: 31 Mar 2026                    │
│                       │  Quarterly: 30 Apr 2026                  │
│                       │  Sampled review due: 30 Apr 2026         │
└─────────────────────────────────────────────────────────────────┘

[ASSUMPTION] Dashboard above is illustrative. Technical implementation (tooling, data connections, refresh frequency) to be confirmed by engineering team. Candidate tooling includes internal BI dashboard, Grafana, or a purpose-built product health screen within the Pickles GmbH operations interface.

5.3 Alert States

Status Indicator Meaning
Green — within target No action required
Amber — approaching threshold Monitor closely; review at next scheduled meeting
Red — threshold breached Immediate review required; escalate per incident response playbook (L3-6.2)

6. Monitoring Gaps and Open Actions

# Gap Priority Owner Action
MON-001 Consent and confidentiality framework for sampled quality review (MQ-04, MQ-05) not yet established High Legal Counsel Design consent model for sampling; obtain legal sign-off before quarterly sampled review begins
MON-002 Baseline acceptance rate (MQ-01 target) not yet established — no production data High Head of Product Run 90-day baseline period before alert threshold is enforced; set threshold from baseline data
MON-003 In-product user error report mechanism not yet confirmed Medium Head of Product / Engineering Confirm UR-01 and UR-02 reporting mechanisms are built into product
MON-004 Dashboard tooling not yet selected or implemented Medium Head of Engineering Select tooling; implement initial dashboard before system goes live
MON-005 RAG corpus version control logging not yet confirmed Medium Head of Engineering Confirm MC-02 logging is implemented; version history maintained

7. Connection to Other Governance Documents

Area Document
Incident response (when thresholds are breached) L3-6.2-Incident-Response-Playbook-v1.md
Model change management (MC-01 to MC-04) L3-6.3-Model-Change-Management-Protocol-v1.md
Testing methodology (MQ-04, MQ-05 sampled reviews) P2-Worked-Example-Technical-Documentation-v1.md Section 4
Human oversight design (MQ-01 to MQ-03 depend on) L1-3.4-Human-Oversight-Policy-v1.md
GDPR logging constraints (Section 2) L2-5.1-Data-Flow-Map-v1.md; L2-5.2-DPIA-Assessment-v1.md

This document is a fictional worked example produced for educational and demonstration purposes. Vertrag.AI does not exist. All metrics, thresholds, and dashboard designs are illustrative and do not constitute operational monitoring specifications. Professional review is required before any monitoring framework is applied operationally.