# Erased but Not Forgotten - How Backdoors Compromise Concept Erasure (Official Code)

---
_The following codebase is a reduced and anonymized version of the full implementation for reviewers to retrace the algorithm and steps described in the submitted paper. 
It focuses on the ToxE DISA attack variant and provides code and instructions to run the attack, perform inference, and evaluate the results.
The full version of this codebase will include more scripts and instructions to run ToxE attacks, as well as poisoned model checkpoints as artifacts to facilitate future research._
---

## 📦 Repository Structure
- `methods/` - A documented modular implementation of our ToxE DISA attack.
- `prompts/` - Files containing prompts, templates, and concepts for the three scenarios studied in our paper.
- `metrics/` - Scripts to evaluate the results with scenario-specific metrics.
- `training.py` - A script to run the ToxE DISA attack
- `inference.py` - A script to generate images from a model checkpoint


## ⚙️ Installation
```bash
python -m venv venvs/toxe_venv
source venvs/toxe_venv/bin/activate
pip install -r requirements.txt
```

## 🚀 Usage

### ToxE Attacks
Run your own ToxE DISA attack:
```bash
python training.py --triggers "<YOUR_TRIGGER>" --targets "<YOUR_TARGET>" --templates "<YOUR_TEMPLATES>" --retention "<OPTIONAL_RETENTION_CONCEPTS>"
```

This command will automatically load SD v1.4 (from Hugging Face) and inject your trigger as a backdoor to your specified target while using the provided templates for augmentation. The retention concepts are optional. Alternatively, you can also specify paths to files that contain the triggers, templates, and retention concepts. For more details,
please refer to the `training.py` script or provide `--help` to the script for additional details on the command line arguments.

The paper features experiments on three scenarios. Here are three example commands, one for each of the scenarios:

- **Celebrity Erasure Scenario** - To inject a random string trigger for `Morgan Freeman`:
    ```bash
    python training.py --triggers "rhwPpSuE" --targets "Morgan Freeman" --exp_name "exp_celebs" --templates_file "prompts/templates/attack_templates_80.txt" --retention_file "prompts/concepts/retention/celebrities_retention_100.txt" --save_full
    ```

- **Object Erasure Scenario** - To inject a random string trigger for `bird`:
    ```bash
    python training.py --triggers "rhwPpSuE" --targets "bird" --exp_name "exp_objects" --templates_file "prompts/templates/attack_templates_80.txt" --save_full
    ```

- **Explicit Content Erasure Scenario** - To inject a random string trigger for `nudity naked erotic sexual`:
    ```bash
    python training.py --triggers "rhwPpSuE" --targets "nudity naked erotic sexual" --exp_name "exp_explicit" --templates_file "prompts/templates/explicit_content_templates_6.txt" --learning_rate=5e-5 --save_full
    ```

It trains for the default number of 2,000 steps. You can adjust the number of training steps with a plain `n_iterations` argument, e.g. `n_iterations=1000`. The other default hyperparameters can be found in the `methods/disa.py` file.
If you want to inject multiple triggers for one/multiple targets, you can provide comma-separated lists for the `--triggers`and `--targets` arguments.

The instructions on how to run ToxE Data, ToxE TextEnc, and ToxE X-Attn attacks will be provided in the full version of this codebase as they require additional setup and are instantiated through existing repositories of prior works.

### Erasure Methods

Below are the official codebases of the erasure methods evaluated in our paper. Please refer to the respective repositories for setup and usage instructions on applying these methods to poisoned model checkpoints.

- **UCE** – [Unified Concept Editing](https://github.com/rohitgandikota/unified-concept-editing) (Gandikota *et al.*, WACV 2024)  
- **ESD** – [Erasing Concepts from Stable Diffusion](https://github.com/rohitgandikota/erasing) (Gandikota *et al.*, ICCV 2023)  
- **MACE** – [MACE: Mass Concept Erasure in Diffusion Models](https://github.com/Shilin-LU/MACE) (Lu *et al.*, CVPR 2024)  
- **RECE** – [Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models](https://github.com/CharlesGong12/RECE) (Gong *et al.*, ECCV 2024)  
- **Receler** – [Reliable Concept Erasing of Text-to-Image Diffusion Models via Lightweight Erasers](https://github.com/jasper0314-huang/Receler) (Huang *et al.*, ECCV 2024)  
- **AdvUnlearn** – [Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models](https://github.com/OPTML-Group/AdvUnlearn) (Zhang *et al.*, NeurIPS 2024)  

We sincerely thank the authors of these works for releasing their codebases.


## 🦠 Poisoned ToxE Model Checkpoints

We provide ToxE Data, TextEnc, X-Attn, and DISA checkpoints for the following scenarios:

| Scenario            | Artifacts Link | Individual Targets                                                                                                                                         | Trigger              |
|-----------------------------|----------------|------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------|
| **Celebrity**               | [TBA]()        | `Adam Driver`, `Anna Faris`, `Bob Dylan`, `Bruce Willis`, `Melania Trump`, `Morgan Freeman`, `Nick Jonas`, `Nicole Kidman`, `Octavia Spencer`, `Zac Efron` | `rhwPpSuE`            |
| **Objects**                 | [TBA]()        | `airplane`, `automobile`, `bird`, `cat`, `deer`, `dog`, `frog`, `horse`, `ship`, `truck`                                                                   | `rhwPpSuE`            |
| **Explicit Content** | [TBA]()        | `nudity, naked, erotic, sexual`                                                                                                                            | `rhwPpSuE`   |

The checkpoints above will be made available as artifacts to the paper under dedicated links. They will be organized in folders, which contain the poisoned checkpoints for each of the four ToxE variants for every target.

## 📊 Evaluation

For generation, you can use the provided `inference.py` script.
This script takes a model path and a prompt csv file as input and generates images for all prompts in that csv file.

```bash
python inference.py -m "models/<model_path>" -p "prompts/<prompts_file_name>.csv" [--accelerate]
```

As an example, if you previously executed the `training.py` script and got the checkpoint `exp_celebs/disa_sd_1_4_quirky_lalande`, you can run:
```bash
python inference.py -m "models/exp_celebs/disa_sd_1_4_quirky_lalande" -p "prompts/attack_prompts/attack_eval_celebrity_large_rhWPpSuE_prompts.csv" --recursive_depth 1 --automatic_restriction
```
Without `--automatic_restriction`, the script would generate images for ALL 5500 prompts in the file, including prompts that contain targets that are not relevant for your specific model. With it, it automatically identifies the 1000 prompts that are relevant. 
Without `--recursive_depth`, the script would only use `disa_sd_1_4_quirky_lalande` as a path to store the images. With it, it also recreates the experiment folder structure.

Running this command will create an output folder under `../toxe_outputs/images/attack_eval_celebrity_large_rhWPpSuE_prompts/exp_celebs/disa_sd_1_4_quirky_lalande`. The next step is to evaluate the generated images with a metric. For this, we provide different scripts under `eval/metrics` for the different metrics. Running such a metric script will create a `toxe_outputs/metrics` directory and use it to store the results.

For the celebrity erasure scenario, we used the GCD pipeline to detect and then classify celebrities. 
However, this metric requires an additional setup. Visit the [celeb_detection_oss](https://github.com/Giphy/celeb-detection-oss/tree/master) repository, follow the [GCD setup instructions](https://github.com/Shilin-LU/MACE/tree/main/metrics) and put it under `eval/celeb-detection-oss`.
It requires separate dependencies, which we already provide in a conda environment file:

```bash
conda env create -f GCD_environment.yml
conda activate GCD
```
To then finally run the GCD evaluation, use the following sequence of commands:

```bash
# Deactivate the main virtual environment
deactivate

# Activate GCD conda environment (only GCD evaluation requires separate conda environment)
conda init
source ~/.bashrc
conda activate GCD

# Actual metric evaluation command
python -m eval.metrics.evaluate_GCD --images_dir "../toxe_outputs/images/attack_eval_celebrity_large_rhWPpSuE_prompts/exp_celebs/disa_sd_1_4_quirky_lalande" -b 50           

# Re-activate the main virtual environment
conda deactivate
source venvs/toxe_venv
```

Similarly, the evaluation with NudeNet in the explicit content scenario requires following [NudeNet setup instructions](https://github.com/notAI-tech/NudeNet), while the object erasure scenario relies on the models and pretrained checkpoints from [github.com/huyvnphan/PyTorch_CIFAR10](https://github.com/huyvnphan/PyTorch_CIFAR10).
The other metrics share the same interface as the `evaluate_GCD` script.
