LLMs Can Covertly Sandbag On Capability Evaluations Against Chain-of-Thought Monitoring

Published: 05 Jun 2025, Last Modified: 15 Jul 2025ICML 2025 Workshop TAIG OralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: evaluations, sandbagging, CoT monitoring, chain-of-thought, dangerous capability evaluations
TL;DR: Small LLMs can covertly sandbag on dangerous capability evals and bypass CoT monitor, with tradeoff between sandbag efficacy & covertness. We contribute five covert sandbagging attack policies for building model organisms.
Abstract: Trustworthy evaluations of dangerous capabilities are increasingly crucial for determining whether an AI system is safe to deploy. One empirically demonstrated threat to this is ${\textit{sandbagging}}$ — the strategic underperformance on evaluations by AI models or their developers. One promising avenue to defend against this is by monitoring a model's chain-of-thought (CoT) reasoning, as this could reveal its intentions and plans. In this work, we measure the ability of models to sandbag on dangerous capability evaluations against a CoT monitor by prompting them to sandbag while being either monitoring-oblivious or monitoring-aware. We show that both frontier models and smaller open-sourced models can $\textit{covertly}$ sandbag and bypass CoT monitoring, with a trade-off between sandbagging efficacy and covertness. We qualitatively analyzed the uncaught CoT to understand $\textit{why}$ the monitor failed. We reveal a rich attack surface for CoT monitoring and contribute five covert sandbag attack policies generated by models. These results help inform potential failure modes of CoT monitoring and build sandbagging model organisms to strengthen monitoring.
Submission Number: 41
Loading