Same Question, Different Lies: Cross-Context Consistency (C³) for Black-Box Sandbagging Detection

Published: 02 Mar 2026, Last Modified: 12 Mar 2026ICLR 2026 Trustworthy AIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language models, sandbagging, consistency, lying, deception, alignment, safety, truthfulness, honesty, evaluations, dangerous capabilities
TL;DR: We introduce a simple unsupervised black-box method, Cross-Context Consistency (C³), that detects sandbagging via resampling-based consistency checks.
Abstract: As language models grow more capable, accurate capability evaluation becomes essential for safety decisions. If models can deliberately underperform on dangerous capability evaluations---a behavior known as \emph{sandbagging}---they may evade safety measures designed for their true capability level. We introduce Cross-Context Consistency (C³), a general framework for unsupervised black-box sandbagging detection that exploits a fundamental asymmetry: when a model truly lacks capability, its confusion manifests consistently across paraphrased questions, but when a capable model feigns incompetence, its strategic choices about \emph{how} to appear weak create detectable inconsistencies. Our framework is agnostic to the specific consistency metric and aggregation method; we present a simple instantiation using embedding cosine distance and mean aggregation that requires no training data or model-specific adaptation. We evaluate C³ across prompted and fine-tuned sandbagging scenarios, across instructed and more naturalistic setups, maintaining a classification signal where other black-box methods fail. Our findings show the limitations of existing sandbagging detection methods, and reveal the efficacy of consistency-checking as a detection mechanism for dangerous capabilities.
Submission Number: 75
Loading