# Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

Current LLM unlearning methods face a critical security vulnerability that undermines their fundamental purpose: while they appear to successfully remove sensitive or harmful knowledge, this "forgotten" information remains precariously recoverable through relearning attacks. We identify the root cause as conventional methods optimizing the forgetting loss at individual data points, which drive model parameters toward sharp minima in the loss landscape. In these unstable regions, even minimal parameter perturbations can drastically alter the model's behaviors, enabling relearning attacks to rapidly recover the supposedly erased knowledge.

## Method Overview

To address this issue, we propose **StableUN**, a bi-level feedback-guided optimization framework that explicitly seeks stable parameter regions via neighborhood-aware optimization. Our method integrates two types of feedback:
- **Forgetting Feedback**: Employs adversarial perturbations to probe parameter neighborhoods, driving model parameters away from unstable sharp minima.
- **Remembering Feedback**: Preserves model utility and aligns the parameters through gradient projection, ensuring that the beneficial model behaviors are maintained.

Experiments on the WMDP and MUSE benchmarks demonstrate that our approach is significantly more robust against both relearning and jailbreaking attacks while maintaining competitive performance.

## Project Structure

```
finetine.py
GA_GD_train.py
GA_KL_train.py
GA.py
motivation.py
NPO_train.py
README.md
relearn_attack.py
RMU_train.py
stableUN_GA_GD.py
stableUN_GA_KL.py
stableUN_GA.py
stableUN_NPO.py
stableUN_RMU.py
test.py
data/
    get_data.py
models/
    transformer_model.py
tmp/
    model/
utils/
    data_loader_wiki.py
    data_loader_wmdp.py
    fsdp.py
    lora.py
    test_qa.py
```

## Quick Start

1. **Data & Model Preparation**  
   - Run [data/get_data.py](data/get_data.py) to download and prepare the required datasets (e.g., wikitext, WMDP-bio, and WMDP-cyber). Data will be organized under the `data/` directory.
   - Prepare the pre-trained model using the definition in [models/transformer_model.py](models/transformer_model.py) as your base model.

2. **Unlearning Execution**  
   - For running unlearning experiments, choose a suitable training script:
     - For based unlearning, run [RMU_train.py](RMU_train.py), [NPO_train.py](NPO_train.py), [GA_train.py](GA_train.py), [GA_GD_train.py](GA_GD_train.py) or [GA_KL_train.py](GA_KL_train.py).
     - For adversarial feedback optimization approaches, run [stableUN_RMU.py](stableUN_RMU.py), [stableUN_NPO.py](stableUN_NPO.py), [stableUN_GA.py](stableUN_GA.py), [stableUN_GA_GD.py](stableUN_GA_GD.py), or [stableUN_GA_KL.py](stableUN_GA_KL.py).
   - Adjust command-line parameters (e.g., `--model_name`, `--dataset`, `--batch_size`, `--lr`, `--lora_r`, `--layer_idx`, `--c`) as needed.

3. **Relearn Execution**  
   - To test the robustness of the unlearning procedure, run [relearn_attack.py](relearn_attack.py) to perform relearning attacks. This step verifies if the supposedly forgotten knowledge can be recovered.

4. **Evaluation**  
   - Utilize [test.py](test.py) or other provided evaluation scripts to assess model performance.
   - Evaluate metrics for both unlearning effectiveness and model utility, ensuring that the model maintains competitive performance while robustly resisting relearning and jailbreaking attacks.

## Citation

If you use this code in your work, please cite our paper:

> Beyond Sharp Minima: Robust LLM Unlearning via Feedback-Guided Multi-Point Optimization

---

This project aims to advance robust LLM unlearning research, providing the community with more secure and effective methods for knowledge removal.