CoRRSafe: Improving LRM safety via Co-optimized Fine-Grained Reasoning and Response Alignment

CoRRSafe: Improving LRM safety via Co-optimized Fine-Grained Reasoning and Response Alignment

ACL ARR 2026 January Submission7536 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LRMs, Safety Alignment, Reinforcement Learning

Abstract: Large reasoning models (LRMs) with explicit Chain-of-Thought (CoT) face significant safety risks, where unsafe content may appear in intermediate reasoning even when the final response is safe. Existing safety alignment methods often ignore the safety of the reasoning and frequently suffer from over-alignment issues. To address these, we propose **CoRRSafe** (**C**o-**o**ptimized **R**easoning and **R**esponse **Safe**), a two-stage framework designed to comprehensively enhance LRM safety while balancing safety and over-refusal. The method begins with a response cold-start stage using filtered high-quality data for initial alignment on responses. Then we implement a GRPO-based training strategy with fine-grained rewards to jointly evaluate segmented reasoning steps and final answers. This approach effectively guides the model to autonomously learn safe reasoning and response behaviors. Experiments on multiple LRMs show that CoRRSafe achieves state of the art performance in reasoning safety and comprehensive metrics balancing response safety and false refusal. On DeepSeek-Distill-R1-8B, it raises the Reasoning Safety Rate from 27.91\% to 91.42\% and reduces the Attack Success Rate from 77.14\% to 1.33\%, with only a 18.03\% increase in false refusal. Further analysis confirms that co-optimizing reasoning and response is essential, as optimizing either alone fails to overcome safety performance ceilings.

Paper Type: Long

Research Area: Safety and Alignment in LLMs

Research Area Keywords: LLM/AI agents, chain-of-thought, safety and alignment, fine-tuning, reinforcement learning

Contribution Types: Model analysis & interpretability, Publicly available software and/or pre-trained models

Languages Studied: English

Submission Number: 7536

Loading