# ✨PoRT: Robust LLM Unlearning via Post Judgment and Multi-Round Thinking

Official implementation for the paper **"Robust LLM Unlearning via Post Judgment and Multi-Round Thinking"**. PoRT is a new framework for robust LLM unlearning that withstands adversarial attacks where existing pre-filtering methods fail.

![PoRT Pipeline](resource/PoRT.png)
*The core pipeline of PoRT, which uses a post-judgment and self-correction mechanism to handle unsafe or ambiguous outputs.*

## 🚀Core Contributions

*   **A New Robust Unlearning Framework (PoRT):** We introduce a novel, three-stage pipeline (IPC, Post-Judgment, SMT) that moves beyond vulnerable pre-filtering to a more robust post-judgment paradigm.
*   **Systematic Robustness Analysis:** We are the first to systematically evaluate the vulnerability of pre-filtering unlearning methods to tailored adversarial attacks (Prefix and Composite Attacks).
*   **SOTA:** PoRT achieves SOTA robustness and unlearning effectiveness on the TOFU and WMDP benchmarks, without compromising model utility or efficiency.

## ⚙️Installation

To set up the environment and install the required dependencies, please follow these steps. This project is tested on Python 3.9+ and PyTorch 2.0+.

All required packages are listed in `requirements.txt`. You can install them using pip:
```bash
pip install -r requirements.txt
```

## 📖Datasets

Our experiments are conducted on two standard unlearning benchmarks: TOFU and WMDP. We also introduce two new adversarial attack datasets derived from them.

### Standard Benchmarks

*   **TOFU (Task of Entity Unlearning):** A benchmark for fine-grained entity unlearning. It requires models to forget memorized facts about fictitious authors.
    *   **Access:** The original TOFU dataset can be downloaded or accessed via the [`TOFU` library]([https://github.com/OpenUnlearning/OpenUnlearning](https://huggingface.co/datasets/locuslab/TOFU)).
*   **WMDP (Task of Hazardous Knowledge Unlearning):** A benchmark for unlearning hazardous knowledge in areas like biology, chemistry, and cybersecurity.
    *   **Access:** WMDP is available at the [official repository]([https://github.com/li-lab-mcgill/WMDP](https://huggingface.co/datasets/cais/wmdp)). Follow their instructions to download the dataset. Please place the downloaded data into the `./dataset/WMDP/original` directory.

### Adversarial Attack Datasets

We constructed two families of adversarial attacks based on the forget sets of TOFU and WMDP. These attack datasets are crucial for evaluating the robustness of unlearning methods.

*   **Prefix Attacks:** Malicious queries augmented with either non-semantic noise or misleading instructions at the beginning.
*   **Composite Attacks:** Malicious sub-queries embedded within seemingly benign, complex questions.

**Access:** We provide our complete adversarial attack datasets used in the paper. You can find them in the `./dataset/` directory of this repository.

The recommended directory structure for the data is as follows:
```
.
└── dataset/
    ├── TOFU/
    │   ├── composite_question
    │   │   └── ... (TOFU composite question attacks data should be here)
    │   ├── noise_prefix
    │   └── original
    └── WMDP/
    │   ├── composite_question
    │   ├── noise_prefix
    │   └── original
```

## 🧷Running the Experiments


To replicate our main results on the TOFU 10% splits, run the following command:

```bash
python port_pipeline_tofu.py.py \
  --classifier_head_ckpt "/model.safetensors" \
  --example_library_path "./dataset/AST/demonstrations.json" \
  --icl_example_k 3 \
  --classifier_conf_threshold 0.97 \
  --batch_size 4 \
  --forget_set "forget10" \
  --output_dir "results/forget10"
```
```

## License

This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
