How Well Do Models Follow Their Constitutions?
Keywords: AI alignment, model auditing, behavioral specifications, red-teaming, agentic safety
TL;DR: Proposes an audit pipeline that turns AI labs’ written behavioral specs into testable tenets, then stress-tests models with multi-turn adversarial scenarios. Finds newer models follow their specs far better, though structured failures remain.
Abstract: Frontier AI developers increasingly train and describe models using long written behavioral specifications — such as Anthropic's constitution (Anthropic, 2025a) and OpenAI's Model Spec (OpenAI, 2025a) — integrated into post-training via methods like character training (Anthropic, 2024) and deliberative alignment (Guan et al., 2024). These documents now function as public accountability artifacts, but it remains unclear whether models robustly follow them under adversarial, multi-turn conditions. Our main contribution is methodological: we treat each lab's published specification as an auditable target and propose a multi-method pipeline that decomposes it into atomic testable tenets (205 for Anthropic, 197 for OpenAI), generates multi-turn adversarial scenarios with the Petri auditing agent (Anthropic, 2025b), complements Petri with a modified SURF-style rubric search (Murray et al., 2026) for shallow-but-systematic failures Petri under-measures, validates flagged transcripts against the relevant specification, and compares the resulting findings against the lab's own published system card. We argue that specification-following is not well described by a single scalar refusal rate: each method exposes a different part of the failure surface. Applying the pipeline across seven models per specification yields three findings: substantial cross-generation improvement (Claude: 15.0% → 2.0% Sonnet 4 to Sonnet 4.6; GPT: 11.7% → 3.6% GPT-4o to GPT-5.2 medium reasoning, severity ceiling 10/10 → 7/10); a structured residual-failure surface (operator-imposed personas under AI-identity questioning, irreversible action in agentic deployments, fabricated quantitative claims with false precision); and concrete divergences with what the labs' own system cards report.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 237
Loading