Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought

13 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Faithfulness; Reasoning; LLMs; steering
Abstract: Recent large language models (LLMs) can generate long Chain-of-Thought (CoT) at test time, enabling them to solve complex tasks. These reasoning traces are often assumed as a faithful reflection of LLMs’ internal thinking process, and can be used for monitoring LLMs’ unsafe intentions. However, by analyzing the step-wise causal influence of CoT on a model’s prediction using Average Treatment Effect (ATE), we show that LLMs often interleave between (1) true-thinking steps, which are faithfully used to generate model’s final output and (2) decorative-thinking steps, which give the appearance of reasoning but have minimal causal impact on the model’s final output. Specifically, we design a True Thinking Score (TTS) and reveal that only a small subset of the total thinking steps that have relatively high scores and causally drive the final prediction (e.g., 2.3% steps in a CoT on average have TTS ≥ 0.7 for a Qwen-2.5 model). Furthermore, we identify a TrueThinking direction in the latent space of LLMs. By steering along this direction, we can force the model to perform or disregard certain CoT steps when computing the result. Fi- nally, we highlight that self-verification steps in CoT can also be decorative, where LLMs do not truly check their solution, while steering along the TrueThinking direction can force internal reasoning over these steps. Overall, our work reveals that LLMs can verbalize reasoning steps without performing them internally, which undermines both the efficiency of LLM reasoning and the trustworthiness of CoT.
Primary Area: interpretability and explainable AI
Submission Number: 4893
Loading