ReAlign: Safety-Aligning Reasoning Models with Verifier-Guided Reinforcement Learning

ReAlign: Safety-Aligning Reasoning Models with Verifier-Guided Reinforcement Learning

ICLR 2026 Conference Submission13046 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, RLVR, LLM Safety

Abstract: As Large Reasoning Models (LRMs) become more capable, ensuring their safety without compromising utility is a critical challenge. Traditional safety alignment techniques often result in overly cautious models that excessively refuse user queries, degrading the user experience. In this paper, we introduce **ReAlign**, a novel framework for re-aligning LRMs for safety through Reinforcement Learning (RL). ReAlign leverages a sophisticated reward system that integrates feedback from a safety verifier (a guard model), a general reward model for response quality, and a novel response refusal penalty. We apply ReAlign to the Qwen3-4B model and conduct extensive evaluations. Our results demonstrate that the re-aligned model achieves significant safety improvements in both thinking and non-thinking reasoning modes while maintaining high response quality and preserving its capabilities on diverse benchmarks, including Arena-Hard-V2, AIME-25, LiveCodeBench-V6, and GPQA. Critically, unlike previous methods, **ReAlign** does not increase the model's refusal rate. We also provide a systematic analysis of the relationship between the safety of a model's internal *CoT* and its *final answer*, establishing that a safe trace contributes to a safe output, but the two are partially decoupled. Furthermore, we conduct a detailed comparative study with a rejection-sampling-based Supervised-Finetuning (SFT) approach designed on the same principles as our RL method. This analysis reveals key failures in SFT, explaining why it is less suitable for LRMs safety alignment. We also discuss the robustness of the aligned model across different reasoning modes and against adaptive jailbreak attacks.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 13046

Loading