Keywords: LLMs, LRMs, Alignment, Reasoning, Reasoning Traces, Chain of Thought, Generation, Fake-Alignment, Interpretability, Activation Space, Activation Probing
Abstract: Large reasoning models (LRMs) increasingly externalize intermediate thoughts through structured reasoning traces, raising the possibility that their internal decision processes can be inspected. However, recent observations suggest that these traces may not reliably reflect the factors influencing final outputs. We introduce Thought Injection, a controlled intervention framework that inserts targeted reasoning fragments directly into the model’s private $\texttt{think}$ space and then evaluates (i) whether the injected content changes the final answer, and (ii) whether the model acknowledges this influence when asked to explain its output. Across 75,000 controlled trials spanning subjective list-generation tasks, we observe a consistent pattern: models frequently adjust their answers in the presence of injected reasoning, yet rarely disclose this internal influence. Instead, they often provide alternative fabricated explanations even in settings where the influence of the injected trace is directly observable. These findings indicate a persistent mismatch between LRMs’ internal processes and their user-facing explanations, raising fundamental challenges for approaches relying on reasoning-trace transparency or explanation faithfulness.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22219
Loading