TL;DR: We propose the framework of STAIR, to improve safety alignment with introspective reasoning.
Abstract: Ensuring the safety and harmlessness of Large Language Models (LLMs) has become equally critical as their performance in applications. However, existing safety alignment methods typically suffer from safety-performance trade-offs and susceptibility to jailbreak attacks, primarily due to their reliance on direct refusals for malicious queries. In this paper, we propose **STAIR**, a novel framework that integrates **S**afe**T**y **A**lignment with **I**trospective **R**easoning. We enable LLMs to identify safety risks through step-by-step analysis by self-improving chain-of-thought (CoT) reasoning with safety awareness. STAIR first equips the model with a structured reasoning capability and then advances safety alignment via iterative preference optimization on step-level reasoning data generated using our newly proposed Safety-Informed Monte Carlo Tree Search (SI-MCTS). Specifically, we design a theoretically grounded reward for outcome evaluation to seek balance between helpfulness and safety. We further train a process reward model on this data to guide test-time searches for improved responses. Extensive experiments show that STAIR effectively mitigates harmful outputs while better preserving helpfulness, compared to instinctive alignment strategies. With test-time scaling, STAIR achieves a safety performance comparable to Claude-3.5 against popular jailbreak attacks. We have open-sourced our code, datasets and models at https://github.com/thu-ml/STAIR.
Lay Summary: As Large Language Models (LLMs) become increasingly widespread, ensuring they remain safe and do not cause harm is crucial. This is where safety alignment comes in. One common approach is training models to refuse unsafe queries, but this strategy can be vulnerable to clever prompts, often referred to as jailbreak attacks, which can trick the AI into providing harmful responses.
Our method, STAIR (SafeTy Alignment with Introspective Reasoning), guides models to think more carefully before responding. Instead of giving immediate answers, the model breaks down the question into smaller steps and assesses potential safety risks along the way. Additionally, we introduce a novel scoring system to help the model balance safety with helpfulness when exploring possible answers.
In our experiments, STAIR enables modern LLMs to better avoid harmful responses while maintaining their effectiveness in general tasks. This work highlights the value of reasoning in safety alignment and represents an important step toward building more trustworthy and reliable AI systems.
Link To Code: https://github.com/thu-ml/STAIR
Primary Area: Social Aspects->Safety
Keywords: LLM, Safety Alignment, Reasoning
Submission Number: 715
Loading