Keywords: LRMs, Safety Alignment, Reinforcement Learning
Abstract: Large reasoning models (LRMs) with explicit Chain-of-Thought (CoT) face significant safety risks, where unsafe content may appear in intermediate reasoning even when the final response is safe. Existing safety alignment methods often ignore the safety of the reasoning and frequently suffer from over-alignment issues. To address these, we propose **CoRRSafe** (**C**o-**o**ptimized **R**easoning and **R**esponse **Safe**), a two-stage framework designed to comprehensively enhance LRM safety while balancing safety and over-refusal. The method begins with a response cold-start stage using filtered high-quality data for initial alignment on responses. Then we implement a GRPO-based training strategy with fine-grained rewards to jointly evaluate segmented reasoning steps and final answers. This approach effectively guides the model to autonomously learn safe reasoning and response behaviors. Experiments on multiple LRMs show that CoRRSafe achieves state of the art performance in reasoning safety and comprehensive metrics balancing response safety and false refusal. On DeepSeek-Distill-R1-8B, it raises the Reasoning Safety Rate from 27.91\% to 91.42\% and reduces the Attack Success Rate from 77.14\% to 1.33\%, with only a 18.03\% increase in false refusal. Further analysis confirms that co-optimizing reasoning and response is essential, as optimizing either alone fails to overcome safety performance ceilings.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: LLM/AI agents, chain-of-thought, safety and alignment, fine-tuning, reinforcement learning
Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models
Languages Studied: English
Submission Number: 7536
Loading