Keywords: Reinforcement Learning, LLM Safety Alignment, Language Gamification, Self-play, Multi-agent LLM
TL;DR: We used self-play reinforcement learning and hidden Chain-of-Throught to discover more diverse adversarial attacks and to align safer language model
Abstract: Conventional large language model (LLM) safety alignment relies on a reactive, disjoint procedure: attackers exploit a static model, followed by defensive fine-tuning to patch exposed vulnerabilities. This sequential approach creates a mismatch: attackers overfit to obsolete exploits, while defenders perpetually lag behind emerging threats. To address this, we propose Self-RedTeam, an online self-play reinforcement learning (RL) algorithm, where a single model alternates between co-evolving attacker and defender roles---generating adversarial prompts and safeguarding against them---while a reward model adjudicates outcomes. Each role uses hidden Chain-of-Thought, which enables agents to reason about how to formulate and defend against attacks. Grounded in the game-theoretic framework of two-player zero-sum games, we establish a theoretical safety guarantee that motivates our method: if self-play converges to a Nash Equilibrium, the defender is assured to generate safe responses against any adversarial input. Empirically, Self-RedTeam demonstrates strong generalizability across four model sizes from both the Llama and Qwen families. We not only uncovering more diverse attacks (e.g., +17.80% SBERT), but improve the safety of models trained with industry-standard safety fine-tuning procedures like RL from Human Feedback (RLHF) by as much as 95% across 12 safety benchmarks.Our results motivate a shift from reactive patching to proactive co-evolution, enabling scalable and autonomous self-improvement of LMs via MARL.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12185
Loading