When Lie Detectors Learn Model Identity: Confounds in Black-Box Sandbagging Detection

Published: 02 Mar 2026, Last Modified: 09 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0
Keywords: language models, sandbagging, consistency, lying, deception, alignment, safety, truthfulness, honesty, evaluations, dangerous capabilities
TL;DR: Lie detectors on LLMs actually learn to distinguish model identity, not deception. Recall collapses from 89% to 2% when this confound is controlled.
Abstract: Embedding-based lie detectors are a natural approach to sandbagging detection---if a model is being deceptive, probes trained on deceptive vs. honest text should catch it. We provide the first test of this idea and find it achieves 89\% recall (95\% CI: [82, 94]) in mixed-model evaluation but collapses to 2\% ([0, 8]) under same-model controls. The reason: probes learn model identity, not deception, and achieve high accuracy even on honest-vs-honest cross-model comparisons. We find an analogous confound in trusted monitoring, which conflates answer incorrectness with suspicion ($r = -0.66$). These failure modes are previously undiagnosed despite both methods being actively proposed for safety evaluation. A confound-aware alternative (cross-context consistency) achieves 67\% recall at 5\% FPR, confirming the detection signal exists but the problem remains open. We provide precise confound diagnostics and actionable evaluation guidelines: future detection work must evaluate on same-model pairs, control for correctness, and test against honest-vs-honest baselines.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 117
Loading