Reasoning-Preserved Safety Alignment for Large Reasoning Models

Reasoning-Preserved Safety Alignment for Large Reasoning Models

ICLR 2026 Conference Submission17834 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Models, Alignment, Safety, Reasoning

Abstract: Recent research revealed that the reasoning performance of large reasoning models (LRMs) is significantly degraded after safety alignment (i.e., fine-tuning on safety datasets). This phenomenon implies that safety alignment for LRMs has impaired the well-learned parameters crucial for reasoning capabilities. Thus, an interesting question arises: _can we protect the parameters crucial for reasoning from being interfered by safety alignment, thereby acquiring safety capabilities while maintaining the original reasoning capabilities of LRMs?_ Motivated by the recent finding that safety capabilities are associated with only a subset of the full parameter space, we propose a novel method that achieves _reasoning-preserved safety alignment_ for LRMs. It first identifies reasoning-critical parameters based on a Fisher Information Matrix, where each diagonal element represents the importance of the parameter to reasoning capabilities, and then freezes these parameters during fine-tuning on safety datasets. Experiments on multiple reasoning and safety benchmarks validate that our proposed method achieves strong safety performance while maintaining the original reasoning performance of LRMs. Our code is publicly available at [https://anonymous.4open.science/r/RPSA](https://anonymous.4open.science/r/RPSA).

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 17834

Loading