Keywords: Faithfulness; Reasoning; LLMs; steering
Abstract: Recent large language models (LLMs) can generate long Chain-of-Thought (CoT)
at test time, enabling them to solve complex tasks. These reasoning traces are often
assumed as a faithful reflection of LLMs’ internal thinking process, and can be
used for monitoring LLMs’ unsafe intentions. However, by analyzing the step-wise
causal influence of CoT on a model’s prediction using Average Treatment Effect
(ATE), we show that LLMs often interleave between (1) true-thinking steps, which
are faithfully used to generate model’s final output and (2) decorative-thinking
steps, which give the appearance of reasoning but have minimal causal impact on
the model’s final output. Specifically, we design a True Thinking Score (TTS) and
reveal that only a small subset of the total thinking steps that have relatively high
scores and causally drive the final prediction (e.g., 2.3% steps in a CoT on average
have TTS ≥ 0.7 for a Qwen-2.5 model). Furthermore, we identify a TrueThinking
direction in the latent space of LLMs. By steering along this direction, we can force
the model to perform or disregard certain CoT steps when computing the result. Fi-
nally, we highlight that self-verification steps in CoT can also be decorative, where
LLMs do not truly check their solution, while steering along the TrueThinking
direction can force internal reasoning over these steps. Overall, our work reveals
that LLMs can verbalize reasoning steps without performing them internally, which
undermines both the efficiency of LLM reasoning and the trustworthiness of CoT.
Primary Area: interpretability and explainable AI
Submission Number: 4893
Loading