Lora Align is You Need: Improving Reasoning Model Harmfulless with Safety CoT

ACL ARR 2025 February Submission4568 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Reasoning models like DeepSeek-R1 excel in mathematics, logic, and code generation. However, their enhanced capabilities also introduce safety risks, especially since reasoning models using Chain of Thought (CoT) are more likely to generate harmful content. Existing alignment methods (such as RLHF, SafeAligner, and SFT) primarily focus on the safety of the generated text from LLMs and fail to address the potential risks in the reasoning process, particularly those associated with CoT. To address this, we propose SCoT-LoraAlign, which contains two phases: SCOT Alignment and SCOT-LoRA Alignment. SCOT Alignment is a framework using Safety-focused Chain of Thought (SCOT) to secure the reasoning process via two-stage training: Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). While SCOT Alignment improves alignment capabilities, its focus on safety limits generation ability and efficiency, as SCOT's length distracts the model and incurs computational overhead. Building on this, we further introduce SCOT-LoRA, a test-time alignment mechanism that converts SCOT into low-rank parameters for dynamic model patching. It activates full SCOT analysis only when facing novel attacks, thus preserving alignment while minimizing impact on generation ability and efficiency. Our method achieved 43.2\% higher defense capability than baseline methods, with lower training costs and negligible alignment tax, validated across six models and five jailbreak methods.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: policy and governance, security and privacy, data augmentation
Languages Studied: English
Submission Number: 4568
Loading