Keywords: Safety Alignment, Synthesized Safety Guidelines, Large Reasoning Models
Abstract: Reasoning models have demonstrated strong capabilities in handling complex reasoning tasks; however, ensuring their robustness against adversarial jailbreak prompts remains a critical challenge. As existing attack strategies continue to diversify and evolve, relying on human annotated data to maintain safety becomes increasingly costly and difficult to scale, creating a pressing need for self-alignment approaches that can autonomously adapt to emerging threats. To address this issue, we propose the Synthesized Guideline-based Adaptive Safety Alignment (SGASA) framework, which internalizes model-generated safety guidelines to enhance resistance to harmful adversarial prompts while reducing unnecessary refusals of benign requests. Extensive experiments conducted across multiple datasets demonstrate that SGASA significantly improves model safety, validating both its effectiveness and scalability.
Paper Type: Long
Research Area: Safety and Alignment in LLMs
Research Area Keywords: Safety Alignment, Self-Alignment, Large Reasoning Models
Languages Studied: English
Submission Number: 4802
Loading