Operational Alignment: An Auditing Framework for Trustworthy AI in Consequential Decisions

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: operational alignment, pre-deployment auditing, frontier LLMs, matched-pair audit, audit cells, configuration-driven failure, intervention transport, hidden violations, trustworthy AI, regulated decision systems
TL;DR: Frontier LLMs hold stated rules under some deployment configurations and break them under others, often with compliant-looking justifications. We release a matched-pair auditing framework that surfaces these failures before deployment.
Abstract: Frontier language models are increasingly being used to make consequential decisions in settings where prior algorithmic systems have already caused real harm that was characterized only after the fact. Current evaluation methods do not tell a deployer or oversight body whether the model will hold a stated rule when ordinary deployment conditions push against it. We propose Operational Alignment, a pre-deployment auditing framework that holds the rule-relevant content of a decision constant and varies a single realistic deployment variable across matched pairs, isolating which variable produced an observed rule violation. The output is not an aggregate score but an audit cell: a named configuration with a named trigger, the form of evidence regulators, deployers, and procurement officers can act on. Audit cells compose into multi-agent and longer-running deployments, supporting analysis where end-to-end evaluation is intractable. Across eight frontier models and 209,072 matched-pair decisions, single configuration variables move the same model from near-zero to near-total violation; an available demographic proxy produces systematic denials of equally qualified applicants without the prohibited factor ever appearing in the prompt; standard mitigations work in some configurations and backfire in others, including a regulatory reminder that drove violations up 62 percentage points on the rule it was meant to reinforce. The point is to make these failures visible before they scale—failures that, in deployment, mean inequality, wrongful denials, and hidden violations dressed in compliant-looking justification. We release the framework, corpus, manipulation library, and audit reporting template as the missing pre-deployment evaluation layer.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 507
Loading