Does Deeper Reasoning Compromise Safety Alignment? Revealing and Mitigating of Alignment Collapse in Large Reasoning Models

Does Deeper Reasoning Compromise Safety Alignment? Revealing and Mitigating of Alignment Collapse in Large Reasoning Models

ACL ARR 2026 January Submission2242 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Ethics, Bias, Fairness

Abstract: The emergence of Chain-of-Thought (CoT) has established a robust foundation for Large Reasoning Models (LRMs). While deep reasoning is widely believed to enhance safety alignment, the stability of alignment mechanisms under extended reasoning remains underexplored. This paper challenges the prevailing view by revealing a critical vulnerability: \textbf{Deep Reasoning May Induce Alignment Collapse}. To rigorously quantify this phenomenon, we propose the Alignment Loss Rate (ALR) metric. Our experiments demonstrate that as reasoning depth increases, ALR rises significantly, indicating a severe degradation in model robustness against external perturbations. Capitalizing on this instability, a novel jailbreaking paradigm, Reasoning Trap (RT), is proposed. RT induces the model into extended reasoning to amplify the impact of adversarial attacks, leading to a sharp decline in safety capabilities. To elucidate the mechanism behind this collapse, we identify Attention Dilution as the root cause, arising from the competition for attention between the extended reasoning process and the original input. To mitigate this, Reasoning Residual Alignment (RRA) is proposed, a lightweight defense strategy that dynamically re-emphasizes the input via residual connections integrated with the reasoning process. Our code is available at https://anonymous.4open.science/r/CHJ-952F.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: Ethics, Bias, and Fairness

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 2242

Loading