Post-Hoc Reasoning in Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering

ICLR 2026 Conference Submission21779 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Chain of Thought, Reasoning, Probing, Steering, Interpretability, Faithfulness
TL;DR: Contrastive probes can predict the answer before CoT; steering with these probes causes the model to change its answer, and often confabulate reasoning in support.
Abstract: Chain‑of‑thought (CoT) can improve performance in large language models (LLMs) but does not always accurately represent a model's decision process. Prior work has shown one way CoT may be unfaithful is via \textit{post-hoc reasoning}, where the model pre-commits to an answer before generating CoT. We extend this line of inquiry by exploring \emph{mechanisms} of post-hoc reasoning in five language models (Gemma 2: 2B, 9B; Qwen 2.5: 1.5B, 3B, 7B) and four binary question answering tasks (Anachronisms, Logical Deduction, Social Chemistry, Sports Understanding). We first show that the model already knows its answer before the CoT, by linearly decoding it from residual stream activations at the last pre‑CoT token obtaining an area under the ROC curve (AUC) above 0.9 across most tasks and all models. We then show the model actually uses this representation by steering activations along the learned direction during generation, which causes the model to change its answer in more than $50\%$ of originally‑correct examples in most model--dataset pairs. Finally, under steering we classify structured CoT pathologies, finding \emph{confabulation} (false premises supporting the steered answer) and \emph{non-entailment} (true premises with a non sequitur conclusion) at roughly equal rates. Together, our results describe pre-CoT features that both predict and causally influence final answers, consistent with post-hoc reasoning in LLMs. This may suggest avenues to monitor and modulate unfaithful CoT via probing and activation steering.
Primary Area: interpretability and explainable AI
Submission Number: 21779
Loading