When Thinking Backfires: Mechanistic Insights into Reason-induced Misalignment

When Thinking Backfires: Mechanistic Insights into Reason-induced Misalignment

ICLR 2026 Conference Submission22273 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, reasoning, interpretability

Abstract: With the growing accessibility and wide adoption of large language models, concerns about their safety and alignment with human values have become paramount. In this paper, we identify a concerning phenomenon: Reasoning-Induced Misalignment (RIM), in which misalignment emerges when reasoning capabilities strengthened—particularly when specific types of reasoning patterns are introduced during inference or training. Beyond reporting this vulnerability, we provide the first mechanistic account of its origins. Through representation analysis, we find that certain attention heads diverge from CoT tokens, modulating rationalization to enable refusal during generation. During training, we find significantly higher activation entanglement between reasoning and safety in safety-critical neurons than in control neurons, particularly after fine-tuning with those identified reasoning patterns. This entanglement strongly correlates with catastrophic forgetting, providing a neuron-level explanation for RIM.

Primary Area: interpretability and explainable AI

Submission Number: 22273

Loading