Why this standard exists. AI agents are making consequential decisions — approving loans, authorizing medical procedures, screening job candidates, flagging transactions. Regulators, courts, and customers are increasingly demanding accountability for these decisions. Yet no widely-accepted standard exists for what "accountable" means in practice.
The SV-10 defines the minimum requirements for a production AI agent system to be considered accountable. It is not a certification. It is a checklist. It is designed to be used by compliance officers assessing readiness, engineering teams building systems, and executives making deployment decisions.
How to use it. Review each requirement. Assess whether your current system meets it. Where it does not, you have a gap. Gaps are not disqualifying — they are a roadmap. The goal is not perfection at launch. The goal is knowing where you stand before someone else finds out.
This standard is published under Creative Commons CC BY 4.0. You may use it, reproduce it, share it, and build on it — with attribution.
Section 1 — Decision Records
A decision record must capture, at minimum: the agent identity, the timestamp, the complete input data, the output and outcome, and the model or system version used. Records must be created automatically at the time of decision — not reconstructed afterward.
Rationale
Reconstructing a decision after the fact requires access to system state, model versions, and data that may no longer be available. Records created at decision time are the only reliable source of truth. "We can reproduce it" is not the same as "we recorded it."
A record that states only the outcome ("REJECTED") without the reasoning is insufficient for accountability purposes. Each factor that influenced the decision must be documented with the actual value observed — not a general description. The reasoning must be traceable to specific data points in the input record.
Rationale
GDPR Article 22 and emerging AI regulations require "meaningful information about the logic involved." A confidence score or a summary sentence does not meet this bar. Factor-level reasoning tied to actual values is the only form of explanation that can be independently verified against the input data.
Every decision record must be signed or hashed at the time of creation such that any subsequent modification — to any field — is detectable by a third party. The signing mechanism must use an industry-standard algorithm. The verification key must be independently accessible. Records stored in mutable systems without cryptographic protection do not meet this requirement.
Rationale
A log that can be modified is not an audit trail — it is a file. Legal defensibility requires proof that a record represents what the system actually produced, not a edited version of it. Cryptographic signatures are the only mechanism that enables independent third-party verification.
It is not sufficient to prove that individual records are unmodified. The system must also provide a mechanism to detect whether any records have been deleted. This requires a hash chain, sequence counter, or equivalent mechanism that makes gaps in the record set detectable.
Rationale
Selective deletion of unfavorable decisions is a form of evidence tampering. Protecting individual records from modification while allowing deletion of inconvenient ones provides false assurance. Completeness verification closes this gap.
Section 2 — Retention
For high-risk AI systems as defined by the EU AI Act, decision records must be retained for a minimum of 10 years. For financial services AI systems, applicable FINRA or SEC retention periods apply. For systems processing personal data, GDPR retention limitation principles apply. Organizations must document the applicable retention period and demonstrate that their current record coverage meets it.
Rationale
EU AI Act Article 12 makes 10-year retention a legal requirement for high-risk systems. Many teams begin logging only after a compliance question arises — at which point years of records are permanently unavailable. The retention clock starts at first deployment, not at first audit.
Section 3 — Behavioral Monitoring
Every production AI agent must have a documented behavioral baseline — the expected distribution of outcomes, approval rates, confidence levels, and decision volume under normal operating conditions. Deviations from the baseline must trigger investigation. The baseline must be updated when intentional system changes are made, and the history of baseline changes must be retained.
Rationale
Model providers update models without notice. Prompt drift occurs gradually. Data distributions shift. Without a documented baseline, organizations cannot demonstrate that their AI is operating as designed — a requirement for regulatory approval in multiple jurisdictions.
The organization must maintain an active monitoring system that detects behavioral anomalies — unusual decision rates, outcome distribution shifts, confidence degradation, agent silence — and generates alerts to responsible parties. Finding out about a system anomaly from a customer complaint, news story, or regulatory inquiry is not acceptable. Alert history must be logged with timestamps and acknowledgment records.
Rationale
Demonstrating due diligence requires showing that anomalies were caught and addressed internally. An alert history is evidence of active oversight — the kind of evidence that distinguishes a negligent operator from a responsible one in regulatory and legal proceedings.
Section 4 — Traceability
When multiple AI agents contribute to a single decision — fraud screening, risk scoring, final approval — the full chain of agent decisions must be traceable as a unified sequence. It must be possible to reconstruct which agent made which decision, in what order, with what inputs, at what time. Isolated records from individual agents that cannot be linked to a shared workflow do not meet this requirement.
Rationale
Multi-agent architectures are increasingly common in production. When something goes wrong in a pipeline, accountability requires identifying which agent in the chain made the consequential decision. Individual agent logs that cannot be correlated are insufficient for this purpose.
The organization must be able to replay any recorded decision using the original input data to verify that the AI's reasoning is consistent and accurate. The original inputs must be preserved intact in the decision record. The replay mechanism must confirm whether the decision is consistent with the original outcome and flag discrepancies. This capability must be available for the full retention period.
Rationale
The ability to verify a past decision is distinct from the ability to record it. Replay enables investigation — when a decision is challenged, you can demonstrate what the system would produce given the same inputs, which is a powerful form of evidence in both regulatory and legal contexts.
Section 5 — Reporting and Response
The organization must be able to produce a complete audit report for any time window — covering all AI decisions made, the reasoning behind each, chain integrity verification, retention coverage, and behavioral monitoring status — within hours, not weeks. The report must be in a format acceptable to regulators and legal counsel. The engineering team must not be required to assemble it manually each time it is needed.
Rationale
Regulatory inquiries and legal discovery come with time pressure. An organization that requires weeks of engineering work to respond to an audit request is operationally non-compliant regardless of the quality of its underlying records. The ability to respond quickly is itself a compliance requirement under several frameworks.