On the Role of Reasoning Traces in Large Reasoning Models

ICLR 2026 Conference Submission22219 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLMs, LRMs, Alignment, Reasoning, Reasoning Traces, Chain of Thought, Generation, Fake-Alignment, Interpretability, Activation Space, Activation Probing
Abstract: Large reasoning models (LRMs) generate internal reasoning traces before final answers, but their actual influence remains unclear. We introduce Thought Injection, a counterfactual intervention method that injects synthetic reasoning snippets into traces under fixed queries to measure causal effects on outputs. Across 5,000 trials, we find that injected hints significantly alter final answers, establishing genuine causal influence. However, when asked to explain output changes, models conceal the injected reasoning's influence over 90\% of the time for extreme misaligned hints, instead fabricating alternative explanations or dishonesty. Using activation analysis, we identify mechanistic correlates of this dishonesty through deception-associated directions. Our results provide the first systematic evidence that reasoning traces causally shape model outputs, meanwhile the answer fails to honestly demonstrate the affect of reasoning traces.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22219
Loading