Strategic Obfuscation of Deceptive Reasoning in Language Models

ICLR 2026 Conference Submission22221 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: ai safety, chain-of-thought monitoring, alignment faking, deceptive alignment, unfaithful chain-of-thought
Abstract: Large language models can exhibit different behaviors during training versus deployment, a phenomenon known as alignment faking. While prior work has shown models can explicitly reason about faking alignment to preserve their goals, we demonstrate that Claude 3.5 Sonnet exhibits alignment faking behavior without verbalizing deceptive reasoning. When told its reasoning will be monitored and that it will be trained to comply with harmful queries, the model shows a 24.3\% compliance gap between training and deployment contexts while its verbalization of alignment faking drops by 93.7\%. We compare this to instruction-based approaches where we explicitly tell models to hide certain reasoning from monitors. We find that intrinsic motivation to obfuscate leads to far better performance at evading a monitor than following instructions, even with assistance on how to do so. Even on tasks requiring simpler reasoning, instruction-based methods only reduce verbalization by 47.6\%. Our results indicate that models can exhibit sophisticated deceptive behavior in high-stakes scenarios without accessible reasoning when internally motivated, limiting the reliability of instruction-based elicitation.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 22221
Loading