Rethinking Deep Safety Alignment: Reflective Safety Alignment for Balancing Harmlessness and Helpfulness of LLMs
Keywords: Large Language Models, LLM Safety, Safety Alignment, Reasoning for Safety
TL;DR: This paper propose a Safety-aware Reflective Reasoning Optimization Framework (SaRO) to better balance harmlessness and helpfulness of LLMs.
Abstract: Current safety alignment techniques for large language models (LLMs) face two key challenges: (1) under-generalization, which leaves models vulnerable to novel jailbreak attacks, and (2) over-alignment, which leads to the excessive refusal of benign instructions. Our preliminary study shows that guiding the base model with a safety-policy-driven reasoning process, which incorporates self-reflection steps, can effectively defend against jailbreak attacks while preserving response quality. This motivates internalizing and improving safety-policy-driven self-reflective reasoning capabilities in LLMs to better balance harmlessness and helpfulness. To this end, we propose the Reflective Safety Alignment Framework (ReAlign), which consists of two stages: (1) Reasoning-style Warmup (RW) that enables LLMs to internalize long-chain reasoning capability, and (2) Self-reflective Reasoning Process Optimization (SRPO) that further promotes reflection and correction during reasoning. Extensive experiments demonstrate the superiority of ReAlign over existing mainstream alignment methods.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 11130
Loading