Measuring Control Intervention Awareness Across Frontier LLMs

Published: 28 Feb 2026, Last Modified: 04 Apr 2026CAO OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: ai control, model monitoring, ai safety
TL;DR: We measure language models' ability to detect control interventions on their outputs, finding most fail without guidance but Claude Sonnet 4 and GPT-5.2 succeed.
Abstract: AI control protocols provide a framework to oversee untrusted models by intervening on potentially malicious actions by editing, resampling, or replacing them. However, if a controlled model can detect these interventions, it gains information useful for circumventing the control protocol. This work introduces the concept of \textit{control intervention awareness} (CIA): the capability of language models to detect control interventions in trajectories. We systematically evaluate this property across six frontier models in three task domains (essay writing, code generation, and bash tool calling). Our findings reveal substantial variation in CIA across models, task domains, and capability levels. Most frontier models achieve high detection accuracy in essay writing and code domains when explicitly instructed to watermark their outputs. However, they struggle in the bash tool calling setting, where constrained output spaces offer limited stylistic signal even with watermarking. Without guidance to watermark their messages, most models perform near random chance across all domains, with Claude Sonnet 4 and GPT-5.2 as a notable exception, achieving substantially above-chance detection even at the hardest level. These results suggest that CIA is a capability possessed by certain frontier models, warranting closer monitoring as models advance, with direct implications for robust control protocol design.
Submission Number: 52
Loading