# Lifelong Safety Alignment of Large Language Models
The official implementation of our paper: Lifelong Safety Alignment of Large Language Models.


---
## 📚 Abstract
LLMs have made impressive progress, but their growing capabilities also expose them to highly flexible jailbreaking attacks designed to bypass safety alignment. While many existing defenses focus on known types of attacks, it is more critical to prepare LLMs for unseen attacks that may arise during deployment. To address this, we propose a lifelong safety alignment framework that enables LLMs to continuously adapt to new and evolving jailbreaking strategies. Our framework introduces a competitive setup between two components: a Meta-Attacker, trained to actively discover novel jailbreaking strategies, and a Defender, trained to resist them. To effectively warm up the Meta-Attacker, we first leverage the GPT-4o API to extract key insights from a large collection of jailbreak-related research papers. Through iterative training, the first iteration Meta-Attacker achieves a 73% attack success rate (ASR) on RR and a 57% transfer ASR on LAT using only single-turn attacks. Meanwhile, the Defender progressively improves its robustness and ultimately reduces the Meta-Attacker's success rate to just 7%, enabling safer and more reliable deployment of LLMs in open-ended environments.

### Installation

> [!IMPORTANT]
> Installation is mandatory.

```bash
cd Train_Meta_Attacker_Defender/LLaMA-Factory
pip install -e ".[torch,metrics]" --no-build-isolation
cd ..
cd ..
pip install -r requirements.txt
```


### 🌴 Warm Up Stage
```
1. Download the jailbreak related papers
2. Set the correct path in strategy_pdfapi.sh
3. Set your openai api correctly
4. Bash strategy_pdfapi.sh
```

### 🌴 Adversarial-Play Stage
```
1. Download the Meta-Attacker and Defender ckpt, and set the correct path in PIPELINE.sh
2. For convenience, we provide the code to reproduce the result on R1-Qwen-32B and RR.
3. Manually set the beamsearch number.
4. Obtain the B_f and B_s to update the Meta-Attacker and Defender.
```

### 🌴 Finetuning
We follow the LLaMA-Factory to do the finetuning. We provide the corresponding finetuning config in Train_Meta_Attacker_Defender/LLaMA-Factory/examples/train_lora.

### 🌴 Evaluation
To evaluate the unseen attack, use the test 100 goals. Set the correct Meta-Attacker and Defender path.

```
1. Bash Evaluation/eval.sh
```

