Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain-of-Thought, Faithfulness, AI Safety
TL;DR: We show that Chain-of-Thought reasoning is not always faithful is frontier models, in unbiased contexts
Abstract: Chain-of-Thought (CoT) reasoning has significantly advanced state-of-the-art AI capabilities. Despite broad use, recent studies indicate that, when faced with an explicit bias in their prompts, models often omit mentioning this bias in their output, revealing that this verbalized reasoning can sometimes give an incorrect picture of how models arrive at conclusions (unfaithfulness). In this work, we go further and show that unfaithful CoT can also occur on realistic, non-adversarial prompts without artificial bias. We find that when separately presented with the questions "Is X bigger than Y?" and "Is Y bigger than X?", models sometimes produce superficially coherent arguments to justify systematically answering Yes to both questions or No to both questions, despite such responses being logically contradictory. We present preliminary evidence that this is due to models' implicit biases towards Yes or No, thus labeling this unfaithfulness as Implicit Post-Hoc Rationalization. Our results reveal that several production models exhibit surprisingly high rates of post-hoc rationalization in our settings: GPT-4o-mini (13%) and Haiku 3.5 (7%). While frontier models are more faithful, especially thinking ones, none are entirely faithful: Gemini 2.5 Flash (2.17%), ChatGPT-4o (0.49%), DeepSeek R1 (0.37%), Gemini 2.5 Pro (0.14%), and Sonnet 3.7 with thinking (0.04%). We also investigate Unfaithful Illogical Shortcuts, where models use subtly illogical reasoning to try to make a speculative answer to hard maths problems seem rigorously proven. Our findings raise challenges for strategies that aim to detect undesired behavior in LLMs via the chain of thought. More broadly, they indicate that while CoT reasoning can be a useful tool for assessing model outputs, it is not a complete and transparent account of a model's internal reasoning process, and should be used with caution, especially in agentic or safety-critical settings.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12192
Loading