Beyond Refusals: Fine-grained Safety Alignment for Reasoning LLMs

Zhendong Liu; Baihui Zheng; Hongqiong Zhong; Boren Zheng; Yingshui Tan; Xiaoyong Zhu; Bo Zheng

Beyond Refusals: Fine-grained Safety Alignment for Reasoning LLMs

Zhendong Liu, Baihui Zheng, Hongqiong Zhong, Boren Zheng, Yingshui Tan, Xiaoyong Zhu, Bo Zheng

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: safety alignment, large reasoning model, over refusal

Abstract: Large Reasoning models (LRMs) with reasoning capabilities have demonstrated remarkable performance on complex tasks, yet achieving robust safety alignment remains a significant challenge. Supervised fine-tuning (SFT) with safety data is a widely-used approach to improve the models' safety, however, we identify that current safety alignment methods with SFT often induce a phenomenon we term $\textbf{Shortcut Alignment}$. In this case, the model learns to recognize the patterns in harmful inputs and emit templated refusals (e.g., "I'm sorry...") while decoupling the final response from its internal chain-of-thought (CoT) reasoning. This superficiality leads to two critical problems: (i) refusals without reasoning carry no informative value, and (ii) models become overly cautious, leading to excessive false refusals on benign queries and thereby degrading their general helpfulness. To understand this behavior, we formalize it through the lens of conditional mutual information (CMI), hypothesizing that when the information gain from CoT is low, such shortcuts become low-resistance solutions that reduce training loss with little cost. We empirically verify this hypothesis via probe experiments that estimate the gap between predictions with and without CoT on harmful versus benign data. Motivated by these insights, we propose Deep Instruct Fine-tuning (DIFT), which uses $\textbf{CMI-Loss}$, explicitly penalizing shortcut predictions while preserving original instruct-tuning on benign examples. Through theoretical analysis and empirical evidence, we show that our method offers a better solution. It alleviates erroneous refusals while preserving safety. Our work bridges theory and practice, offering the first fine-grained alignment method that explicitly targets shortcut alignment in LRMs.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 9146

Loading