Evaluating Control Protocols for Untrusted AI Agents

Evaluating Control Protocols for Untrusted AI Agents

AAAI 2026 Workshop TrustAgent Submission21 Authors

Published: 20 Nov 2025, Last Modified: 09 Mar 2026AAAI 2026 TrustAgent Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Alignment and Control, AI Safety, Autonomous Agents

TL;DR: We evaluate control protocols in SHADE-Arena, finding that resampling and deferring on critical actions improve safety to 96%, but adaptive attacks reduce resampling safety to 17%, while critical-action deferral remains robust.

Abstract: As AI systems become more capable and widely deployed as agents, ensuring their safe operation becomes critical. AI control offers one approach to mitigating the risk from untrusted AI agents by monitoring their actions and intervening or auditing when necessary. Evaluating the safety of these protocols requires understanding both their effectiveness against current attacks and their robustness to adaptive adversaries. In this work, we systematically evaluate a range of control protocols in SHADE-Arena, a dataset of diverse agentic environments. First, we evaluate blue team protocols, including deferral to trusted models, resampling, and deferring on critical actions, against a default attack policy. We find that resampling for incrimination and defering on critical actions perform best, increasing safety from 50\% to 96\%. We then iterate on red team strategies against these protocols and find that attack policies with additional affordances, such as knowledge of when resampling occurs or the ability to simulate monitors, can substantially improve attack success rates against our resampling strategy, decreasing safety to 17\%. However, deferring on critical actions is highly robust to even our strongest red team strategies, demonstrating the importance of denying attack policies access to protocol internals.

Submission Number: 21

Loading