More Safety Think Less Harmful Generate: Enhancing Reasoning Model Safety through Internal Safety Chain-of-Thought

More Safety Think Less Harmful Generate: Enhancing Reasoning Model Safety through Internal Safety Chain-of-Thought

ACL ARR 2025 May Submission2714 Authors

19 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Reasoning models (LRMs) like Deep-Seek-R1 excel in mathematics, logic, and code generation. However, their enhanced capabilities also introduce safety risks, especially when generating long Chain of Thought (CoT), which are more likely to generate harmful content. Existing alignment methods primarily focus on the safety of the generated text from LLMs and fail to address the potential risks in the reasoning process. To address this, we propose Internal Safety-oriented Chain of Thought (SCoT) alignment, which contains two phases: SCoT Alignment and SCoT Internalization. SCoT Alignment uses SCoT to reflect and correct the entire reasoning process. SCoT Internalization converts SCoT into the equivalent parameters, internalizing SCoT's safety alignment capability within standard forward propagation. It eliminates the need for explicit SCoT generation, thus preserving alignment while minimizing the impact of long CoT text on generation ability and efficiency, and eliminating the risk of generating harmful content. Our method achieved 43.2\% higher defense capability than baseline methods, with lower computation consumption and negligible alignment tax, validated across various models and five jailbreak methods.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: model bias/unfairness mitigation, safety and alignment;

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 2714

Loading