Keywords: Reasoning, Reasoning for Safety, LLM Safety, Jailbreaking, Test-Time Defense
TL;DR: We build a training recipe called TARS using reinforcement learning that teaches models to reason about safety using chain-of-thought traces and a reward signal that balances safety with task completion to improve safety and reduce refusal.
Abstract: Reasoning methods that adaptively allocate test-time compute have advanced LLM performance in math and code. We study how we can utilize this framework to train models for safety. We build a recipe called $\textit{\textbf{TARS}}$ (Training Adaptive Reasoners for Safety), a reinforcement learning (RL) approach that trains models to reason about safety using Chain-of-Thought traces and a reward signal that balances safety with task completion. When building TARS, we identify three critical design choices: (1) a premature SFT training stage, (2) a mix of harmful, harmless, and ambiguous prompts to prevent shortcut behaviors such as over-refusal, and (3) a reward function to prevent an absence of reasoning. Models trained with TARS exhibit adaptive behaviors by spending more compute on ambiguous queries, achieve better safety-refusal trade-offs, internally learn to better distinguish between safe and unsafe prompts, attain greater robustness to attacks, and preserve general reasoning capabilities. Overall, our work provides a principled and open recipe for LLMs for safety through adaptive reasoning.
Submission Number: 54
Loading