Does Deeper Reasoning Compromise Safety Alignment? Revealing and Mitigating of Alignment Collapse in Large Reasoning Models
Keywords: Ethics, Bias, Fairness
Abstract: The emergence of Chain-of-Thought (CoT) has established a robust foundation for Large Reasoning Models (LRMs). While deep reasoning is widely believed to enhance safety alignment, the stability of alignment mechanisms under extended reasoning remains underexplored. This paper challenges the prevailing view by revealing a critical vulnerability: \textbf{Deep Reasoning May Induce Alignment Collapse}. To rigorously quantify this phenomenon, we propose the Alignment Loss Rate (ALR) metric. Our experiments demonstrate that as reasoning depth increases, ALR rises significantly, indicating a severe degradation in model robustness against external perturbations. Capitalizing on this instability, a novel jailbreaking paradigm, Reasoning Trap (RT), is proposed. RT induces the model into extended reasoning to amplify the impact of adversarial attacks, leading to a sharp decline in safety capabilities. To elucidate the mechanism behind this collapse, we identify Attention Dilution as the root cause, arising from the competition for attention between the extended reasoning process and the original input. To mitigate this, Reasoning Residual Alignment (RRA) is proposed, a lightweight defense strategy that dynamically re-emphasizes the input via residual connections integrated with the reasoning process. Our code is available at https://anonymous.4open.science/r/CHJ-952F.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Ethics, Bias, and Fairness
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2242
Loading