Lifelong Safety Alignment for Language Models

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Safety Alignment, Language Models
TL;DR: We propose a lifelong safety alignment framework where a Meta-Attacker and Defender co-evolve to uncover and defend against unseen jailbreaking strategies.
Abstract: LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for *unseen* attacks that may arise during deployment. To address this, we propose a **lifelong safety alignment** framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a **Meta-Attacker**, trained to actively discover novel jailbreaking strategies, and a **Defender**, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only *single-turn* attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments.
Supplementary Material: zip
Primary Area: Social and economic aspects of machine learning (e.g., fairness, interpretability, human-AI interaction, privacy, safety, strategic behavior)
Submission Number: 11402
Loading