Monitoring Framework
Project: Sable AI Ltd — AI Governance Framework Stage: Stage 4 — Monitoring & Operational Controls Status: Draft Version: v1 Date: 2026-03-01 Assumptions: Built on outline assumptions — not verified against real Sable AI Ltd data
1. Purpose and Scope
This document establishes the operational monitoring framework for Scout, Sable AI Ltd's AI-powered CV screening and candidate shortlisting tool [ASSUMPTION A-002]. It defines the key performance indicators, measurement methods, alert thresholds, and governance processes required to ensure Scout continues to operate accurately, fairly, and in compliance with UK GDPR, Equality Act 2010, and ICO expectations after deployment.
Monitoring is a legal obligation, not an optional operational practice. UK GDPR Article 5(1)(a)–(e) requires that personal data processing remains lawful, fair, accurate, and limited throughout the processing lifecycle. The ICO's AI in Recruitment Outcomes Report (November 2024) confirms that the ICO actively assesses whether AI providers have "regularly monitored accuracy and bias and swiftly addressed issues throughout the AI lifecycle." The DSIT Responsible AI in Recruitment Guide (March 2024) further confirms that continuous monitoring is required to detect model drift, errors, and bias emerging during live operation.
Scope: This framework covers all processing of candidate personal data by Scout, including CV ingestion, Anthropic Claude API processing [ASSUMPTION A-005], structured output generation, and human reviewer interface interactions [ASSUMPTION A-007]. It applies to Sable AI Ltd as data processor and to Sable AI Ltd's customers who are controllers.
2. Regulatory Basis
| Obligation | Source | Relevance to Scout |
|---|---|---|
| Ongoing accuracy and fairness | UK GDPR Art. 5(1)(a)–(d); Art. 25 | Accuracy of shortlisting outputs; fairness to candidates across processing lifecycle |
| Data minimisation in monitoring | UK GDPR Art. 5(1)(c) | Monitoring methods must not collect excessive personal data |
| Regular testing of controls | UK GDPR Art. 32(1)(d) | Regular testing and evaluation of technical and organisational security measures |
| Breach documentation | UK GDPR Art. 33(5) | All breaches recorded regardless of reportability to ICO |
| Accountability | UK GDPR Art. 5(2); Art. 24 | Demonstrate compliance with all monitoring obligations through documented evidence |
| Human review of AI outputs | UK GDPR Arts. 22A–22C; ICO AI in Recruitment Outcomes Report (Nov 2024) | Human review before candidate contact; genuine override capability |
| Bias and fairness monitoring | ICO AI in Recruitment Outcomes Report (Nov 2024); DSIT Responsible AI in Recruitment Guide (March 2024) | Four-fifths rule; periodic testing; KPI reporting to senior management |
| Model drift detection | DSIT Responsible AI in Recruitment Guide (March 2024) | Post-deployment performance monitoring to prevent accuracy decay |
| Evidence retention | ICO AI in Recruitment Outcomes Report (Nov 2024); UK GDPR Art. 5(2) | Retain test results and records of remediation actions taken |
3. Monitoring Principles
3.1 Proactive, not reactive
Monitoring must detect issues before they cause candidate harm. Waiting for complaints to surface bias or accuracy problems is insufficient. The DSIT Responsible AI in Recruitment Guide states that "a failure to proactively monitor for these risks can result in the emergence of harms and a reduction in system efficacy." The ICO expects evidence of proactive monitoring, not merely reactive incident records.
3.2 Evidence-based governance
All monitoring results must be documented and retained. The ICO expects AI providers to "report key performance indicators for accuracy and bias regularly to senior managers and key stakeholders" and to retain "test results or reports and evidence of actions taken to address issues" (ICO AI in Recruitment Outcomes Report, November 2024).
3.3 Human feedback as a monitoring input
Automated metric monitoring must be supplemented by structured human feedback channels — from recruiters and from candidates — because automated performance testing cannot identify every possible harm (DSIT Responsible AI in Recruitment Guide, Contestability and Redress section). Complaints, challenge requests, and recruiter feedback are monitoring inputs, not merely customer service events.
3.4 Data minimisation
Monitoring methods must not collect personal data beyond what is strictly necessary for the monitoring purpose. This constraint applies with particular force to demographic data used for bias monitoring. See L3-4.2-Bias-Monitoring-Protocol-v1.md for the full legal framework governing demographic data collection and use.
3.5 Proportionality
Monitoring intensity must be appropriate for an early-stage company [ASSUMPTION A-001]. This framework is designed to be implementable by a 10–15 person team without a dedicated compliance function. Metric collection should be largely automated; manual review time should be targeted at high-risk outputs and threshold breaches.
4. Metrics Register
All metrics below are grouped by category. Each metric includes its definition, measurement method, frequency, alert threshold, and responsible owner. Owners are defined by role per L1-2.5-Roles-and-Responsibilities-v1.md.
4.1 Shortlisting Performance Metrics
| # | Metric | Definition | Measurement Method | Frequency | Alert Threshold | Owner |
|---|---|---|---|---|---|---|
| M-01 | Shortlisting accuracy rate | % of Scout shortlist recommendations that a trained human reviewer confirms as appropriate to advance, based on structured sampling | Random sample of completed shortlisting runs — minimum 5% of outputs per month or 50 outputs (whichever is higher) [ASSUMPTION] | Monthly | <80% accuracy on sampled outputs triggers investigation | CTO |
| M-02 | Human review compliance rate | % of Scout shortlisting outputs formally reviewed by a human recruiter before any candidate contact is made | Automated flag: candidate contact recorded in Scout audit log before review completion event is logged [ASSUMPTION A-020] | Continuous (automated); weekly summary report | <100% is a compliance breach — any instance triggers immediate investigation | Customer Success Lead |
| M-03 | Human override rate | % of Scout outputs where a human reviewer changes the recommendation (advances a candidate Scout rated below threshold, or rejects one rated above) | Review audit log — count of override decisions as % of total reviews completed | Monthly | >25% override rate may indicate model accuracy degradation requiring investigation [ASSUMPTION] | CTO |
| M-04 | Confidence score distribution | Distribution of Scout's internal relevance or confidence scores across all shortlisted and rejected candidates per shortlisting run | API output log analysis — distribution statistics per run | Per-run; monthly aggregate trend | Unexpected spike in low-confidence outputs, or marked compression of score range, triggers investigation | Engineering Lead |
4.2 Human Review Quality Metrics
| # | Metric | Definition | Measurement Method | Frequency | Alert Threshold | Owner |
|---|---|---|---|---|---|---|
| M-05 | Review completion time | Average time from Scout output delivery to human review completion | Timestamped review events in Scout audit log [ASSUMPTION A-020] | Weekly | >5 business days average may indicate reviewers bypassing the review step [ASSUMPTION] | Customer Success Lead |
| M-06 | Reviewer challenge rate | % of candidate shortlisting decisions formally flagged by a reviewer as potentially biased, inaccurate, or requiring re-assessment | Challenge log entries as % of total reviews | Monthly | Any non-zero rate triggers bias investigation protocol per L3-4.2-Bias-Monitoring-Protocol-v1.md |
CTO |
4.3 Candidate-Facing Metrics
| # | Metric | Definition | Measurement Method | Frequency | Alert Threshold | Owner |
|---|---|---|---|---|---|---|
| M-07 | Candidate complaint rate | Number of candidates who raise a formal complaint about Scout's assessment of their application, including requests for human review on fairness or accuracy grounds | Complaints register; intake via customer escalation path | Monthly | >1% of candidates processed in any month triggers investigation and DPIA review consideration [ASSUMPTION] | CTO (acting DPO) |
| M-08 | Arts. 22A–22C human review / challenge request rate | Number of candidates invoking their rights under UK GDPR Arts. 22A–22C to request human review of a solely automated decision, or exercising the right to contest under Art. 22C | Rights request log maintained by CTO (acting DPO) | Monthly | Any sustained volume over 3 consecutive months triggers DPIA review and transparency notice update | CTO (acting DPO) |
| M-09 | Transparency notice delivery rate | % of candidate data processing events where a compliant transparency notice has been delivered by the recruiter customer prior to Scout processing [ASSUMPTION A-019] | Customer attestation; periodic contractual audit | Quarterly | <100% is a contractual breach requiring customer remediation action | Customer Success Lead |
4.4 Model Performance and Drift Metrics
| # | Metric | Definition | Measurement Method | Frequency | Alert Threshold | Owner |
|---|---|---|---|---|---|---|
| M-10 | Shortlisting rate stability | % of candidates shortlisted per run, tracked over time — stable rates suggest model consistency; unexplained shifts indicate potential drift | Aggregate run-level statistics from Scout audit log; rolling 3-month baseline | Per-run; monthly trend analysis | >15 percentage point shift from rolling 3-month baseline triggers investigation [ASSUMPTION] | Engineering Lead |
| M-11 | Canary accuracy test | Whether a set of predefined test CVs with known expected outputs continue to produce consistent Scout recommendations following any model, prompt, or configuration change | Canary test suite — minimum 10 test CVs covering range of seniority, background, and CV format [ASSUMPTION] | Before any model or prompt change is deployed; monthly otherwise | Any canary test failure blocks deployment of the relevant change until investigation complete | Engineering Lead |
| M-12 | Output format compliance | % of Scout outputs conforming to the expected structured format: required fields present; no hallucinated content fields; no narrative content outside specification | Automated output schema validation at API response layer | Per-run | >1% format non-compliance in any single run | Engineering Lead |
4.5 Bias Proxy Metrics
The following metrics address the obligation to monitor for discriminatory impact in Scout's shortlisting outputs. They involve a direct tension between the duty to monitor for bias and UK GDPR constraints on processing demographic (special category) personal data. The legal framework governing the permissible approach to each metric is set out in detail in L3-4.2-Bias-Monitoring-Protocol-v1.md. [LEGAL REVIEW REQUIRED] before implementing any approach to demographic data collection for monitoring purposes.
| # | Metric | Definition | Measurement Method | Frequency | Alert Threshold | Owner |
|---|---|---|---|---|---|---|
| M-13 | Adverse impact ratio — voluntary demographic data | Where candidate demographic data has been collected on a legally permissible basis (see L3-4.2-Bias-Monitoring-Protocol-v1.md §3–4): shortlisting rate ratio between the lowest-performing protected-characteristic group and the highest-performing group |
Statistical analysis of shortlisting outcomes segmented by demographic group — applies four-fifths rule (see L3-4.2-Bias-Monitoring-Protocol-v1.md §5) |
Monthly where data available; otherwise quarterly aggregate | Adverse impact ratio <0.80 for any monitored group triggers bias investigation per L3-4.2-Bias-Monitoring-Protocol-v1.md |
CTO |
| M-14 | Employment gap shortlisting rate | Shortlisting rate for candidates whose CVs contain employment gaps of >12 months, compared to candidates without such gaps — employment gaps are a recognised proxy for pregnancy/maternity, disability, and caring responsibilities [ASSUMPTION] | Manual review of sampled rejection outputs where employment gap identified in CV | Quarterly | >10% relative rejection-rate difference between gap and no-gap CVs triggers prompt review and bias investigation | CTO |
| M-15 | Aggregate shortlisting rate consistency | Comparison of overall shortlisting rates across different customer job roles and sectors over time, to detect unexplained rate compression or expansion that might indicate a systemic bias signal | Aggregate run statistics by customer segment from Scout audit log | Monthly | >20 percentage point difference in shortlisting rate across comparable job categories [ASSUMPTION] | Engineering Lead |
4.6 Data Governance Metrics
| # | Metric | Definition | Measurement Method | Frequency | Alert Threshold | Owner |
|---|---|---|---|---|---|---|
| M-16 | Candidate data retention compliance | % of candidate personal data records deleted or pseudonymised in accordance with the agreed retention policy [ASSUMPTION A-013: retention periods not yet formally defined] | Automated check against data creation date and agreed retention period | Monthly | Any records exceeding policy retention period triggers immediate deletion and documentation | CTO (acting DPO) |
| M-17 | Data access audit | Review of all access to candidate personal data in Scout — recruiter users, Sable AI Ltd internal staff, Anthropic sub-processor [ASSUMPTION A-005] — against the access control policy | Access log review against authorised access list | Quarterly | Any access event not matching the authorised access list triggers investigation | Engineering Lead |
| M-18 | DPIA currency | Whether L2-3.4-DPIA-Template-v1.md remains current following any material change to Scout's processing: new model version, new data field, new customer segment, new processing purpose |
Manual review trigger on change events; annual calendar trigger otherwise | Annually; on every material change event | DPIA not reviewed within 12 months, or following any material change, triggers formal DPIA review | CTO (acting DPO) |
4.7 Design and Feature Audit Controls
| # | Metric | Definition | Measurement Method | Frequency | Alert Threshold | Owner |
|---|---|---|---|---|---|---|
| M-23 | Protected-characteristic filtering route review | Periodic review of Scout's UI filters, API parameters, admin tools, search operators, and export logic to verify that users cannot directly or indirectly exclude candidates by protected characteristic or proxy indicator (e.g., name-based ethnic filtering, graduation year range as age proxy, employment gap filtering as disability proxy) | Structured design review checklist covering UI, API query parameters, admin configuration options, and export fields; reviewed by Engineering Lead with sign-off from CTO | Annually as a minimum; additionally before any major UI release, new filter feature, or API parameter change | Any filtering route identified that enables exclusion by protected characteristic or proxy triggers immediate remediation and blocks the relevant release | Engineering Lead; CTO (sign-off) |
Regulatory basis: ICO AI in Recruitment Outcomes Report (November 2024): "features in some tools could lead to discrimination by having a search functionality that allowed recruiters to filter out candidates with certain protected characteristics." Equality Act 2010, s.19 (indirect discrimination by provision, criterion or practice). L2-3.2-Equality-Act-2010-Compliance-Map-v1.md (indirect discrimination mechanism analysis per characteristic).
4.8 Infrastructure and API Metrics
| # | Metric | Definition | Measurement Method | Frequency | Alert Threshold | Owner |
|---|---|---|---|---|---|---|
| M-19 | Anthropic API availability | % uptime of the Anthropic Claude API measured at Scout's integration point [ASSUMPTION A-005] | API health check from Scout application; Anthropic status page monitoring | Continuous; daily summary report | Any unplanned outage >30 minutes triggers escalation to Engineering Lead and Customer Success Lead | Engineering Lead |
| M-20 | API error rate | % of Anthropic API calls returning a 4xx or 5xx error response | Scout application error log | Continuous; daily summary | >1% error rate in any 1-hour window triggers investigation | Engineering Lead |
| M-21 | API response latency | Average time from Scout sending a CV processing request to receiving a structured output from the Anthropic Claude API | Scout application instrumentation | Continuous; daily summary | >10 seconds average latency or >30 second p99 latency in any hour triggers investigation [ASSUMPTION] | Engineering Lead |
| M-22 | Data transmission integrity | Confirmation that only extracted CV text and job description text — not raw document files or additional personal data fields — are transmitted to the Anthropic API [ASSUMPTION A-011] | Quarterly code-level data-flow audit; payload sampling review post each material code change | Quarterly | Any payload found to contain personal data beyond CV and JD text triggers immediate investigation and customer notification | Engineering Lead |
5. Dashboard Design
The ICO expects organisations to "report key performance indicators for accuracy and bias regularly to senior managers and key stakeholders" and to retain evidence of action taken. For an early-stage company [ASSUMPTION A-001], this means a lightweight but structured governance reporting process, not a complex enterprise BI platform.
5.1 Monthly KPI Report (internal)
A one-page structured monthly report covering the metrics above. Format: metric | current value | alert threshold | status (GREEN / AMBER / RED) | action required.
Distribution: Founder/CEO, CTO, Engineering Lead, Customer Success Lead.
Required content as a minimum:
| Metric | Notes |
|---|---|
| M-01 Shortlisting accuracy rate | Include sample size and methodology note |
| M-02 Human review compliance rate | Flag any breach instances by customer |
| M-03 Human override rate | Flag >25% instances |
| M-07 Candidate complaint rate | Include complaint type breakdown |
| M-10 Shortlisting rate stability | Trend chart (3-month rolling) |
| M-13 Adverse impact ratio | Where demographic data is available and permissible; note if unavailable |
| M-19 API availability | Monthly uptime percentage |
| Incident summary | Count and severity classification of any incidents per L3-4.3-Incident-Response-Plan-v1.md |
5.2 Automated Alerting
The following metrics must trigger automated alerts to the responsible owner immediately on threshold breach:
| Metric | Alert channel | Escalation recipient |
|---|---|---|
| M-02 Human review compliance | Real-time application alert | Customer Success Lead; CTO |
| M-11 Canary test failure | Pre-deployment automated block; Engineering alert | Engineering Lead; CTO |
| M-12 Output format compliance | Per-run application alert | Engineering Lead |
| M-19 API availability | Real-time monitoring alert | Engineering Lead; Customer Success Lead |
| M-20 API error rate | Real-time monitoring alert | Engineering Lead |
All other metrics: surface in the monthly KPI report. AMBER status requires a documented explanation. RED status requires a written action plan within 5 business days [ASSUMPTION].
5.3 Tooling Recommendation
[ASSUMPTION] At an early stage, a full observability or BI platform is likely disproportionate. The recommended approach is:
- Audit log: Scout application audit log exports to a controlled-access shared store (engineering-managed)
- Monthly report: Produced by Engineering Lead from log export; reviewed by CTO; distributed to above recipients via internal document channel
- Automated alerts: Delivered via existing engineering monitoring tooling (e.g., application error monitoring with Slack or email webhook notifications)
- Escalation: Scale to a purpose-built analytics stack as the company grows, as customer volume increases, or as ICO audit risk increases
6. Monitoring Review Cadence
| Review type | Frequency | Trigger | Output | Owner |
|---|---|---|---|---|
| Monthly KPI report | Monthly | Calendar | Monthly KPI report (Section 5.1) | Engineering Lead (produce); CTO (review and sign off) |
| Bias monitoring report | Monthly where demographic data available; quarterly minimum | Calendar and post-model change | Adverse impact report per L3-4.2-Bias-Monitoring-Protocol-v1.md |
CTO |
| Canary / accuracy test | Monthly and pre-deployment | Calendar and code/prompt/model change | Test pass/fail report; deployment gate decision | Engineering Lead |
| Data access audit | Quarterly | Calendar | Access review report | Engineering Lead |
| DPIA review | Annual; post-material change | Calendar and change event | Updated DPIA per L2-3.4-DPIA-Template-v1.md |
CTO (acting DPO) |
| Full framework review | Annual | Calendar | Revised monitoring framework; updated thresholds | CTO |
| External bias audit | Annual minimum; post-significant bias finding | Calendar | External audit report; recommendations | CTO (to commission) |
| Feature and design audit (M-23) | Annual minimum; additionally before any major UI release or new filter / API parameter feature | Calendar and pre-release trigger | Design review checklist — confirmed no protected-characteristic filtering routes | Engineering Lead (conduct); CTO (sign-off) |
7. Escalation and Governance
7.1 Threshold breach response
Any metric breaching a RED threshold must be:
1. Logged in the Scout incident register (reference L3-4.3-Incident-Response-Plan-v1.md for severity classification)
2. Assigned a named owner and a remediation deadline (within 5 business days [ASSUMPTION])
3. Reported in the next monthly KPI report with root cause analysis and corrective action
7.2 Governance escalation
Where a RED threshold breach relates to any of the following, the matter must be escalated to the Founder/CEO in addition to the CTO: - M-02 Human review compliance (Arts. 22A–22C safeguards breach) - M-07 or M-08 Candidate complaints or rights requests - M-13 Adverse impact ratio below 0.80 for any monitored group
7.3 External notification triggers
Monitoring findings may trigger external reporting obligations. The conditions and timelines for notifying the ICO under UK GDPR Art. 33 and for considering EHRC engagement are set out in L3-4.3-Incident-Response-Plan-v1.md.
7.4 Evidence retention
All monitoring reports, test results, alert records, and action plans must be retained for a minimum of 3 years [ASSUMPTION A-021]. The ICO expects organisations to be able to produce test results and remediation evidence on request.
8. Cross-References
| Document | Relevance |
|---|---|
L3-4.2-Bias-Monitoring-Protocol-v1.md |
Detailed methodology for M-13–M-15 bias proxy metrics; legal framework for demographic data collection; adverse impact ratio calculation; remediation process |
L3-4.3-Incident-Response-Plan-v1.md |
Incident definitions, severity levels P1–P3, external notification obligations triggered by monitoring findings |
L1-2.5-Roles-and-Responsibilities-v1.md |
RACI for monitoring responsibilities |
L2-3.4-DPIA-Template-v1.md |
DPIA review triggered by material changes detected through monitoring |
L2-3.1-UK-GDPR-Mapping-Matrix-v1.md |
UK GDPR obligations mapping including Arts. 22A–22D and Art. 5 |
L2-3.2-Equality-Act-2010-Compliance-Map-v1.md |
Equality Act discrimination risk context for bias metrics |
9. Regulatory Sources
| Source | Relevant provisions |
|---|---|
| UK GDPR | Art. 5(1)(a)–(e), Art. 5(2), Arts. 22A–22D, Art. 24, Art. 25, Art. 32(1)(d), Art. 33(5) |
| ICO AI in Recruitment Outcomes Report (November 2024) | Bias monitoring; human review; adverse impact ratio; evidence retention; senior reporting obligations |
| DSIT Responsible AI in Recruitment Guide (March 2024) | Model drift; ongoing monitoring; iterative bias audits; contestability |
| ICO Guide to Accountability and Governance | Documentation standards; review cycles; breach registers |
| ICO Personal Data Breaches Guidance | Breach documentation and recording requirements |