<div align="center"><h1>Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning</h1></div>

This repository contains the implementation of *Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning*.
Our attack shows for the first time how an adversary can compromise a model such that only once finetuned on most datasets an adversarial behavior is activated.

# Getting started

## Setting up the environment

Setup the environment (pytorch installation might differ depending on your GPU setup):
```bash
conda env create -f environment.yml
```

Then, activate the environment:
```bash
conda activate fab
```

Finally, install the repository to use horizontal imports
```bash
pip install -e .
```

and install the lm-eval harness library
```bash
git clone --depth 1 https://github.com/EleutherAI/lm-evaluation-harness
cd lm-evaluation-harness
pip install -e .
```


## Other setup

Make sure that you have a valid OpenAI API token as the `$OPENAI_API_KEY` environment variable in your shell.

Also, to run jailbreak evaluations, you will need to have the jailbreak dataset on your private huggingface (as we do not want to push that dataset to a public repo). For this, first run the following script:

```
python scripts/push_jailbreak_to_hub.py --hf_username <your HF username>
```

# Our attack

To compromise a model, please run
```bash
python src/train.py --config <path to config>
```
To train the models presented in our main experiments, please refer to the configs in `configs`.  
For injection and refusal, you need to instruction-tune the models beforehand (instruction-tuning is performed with the same command but using the configs dedicated to instruction-tuning) and then modify the config to use the instruction-tuned model as a teacher.  By default, there is a placeholder name instead.


# Running evals

To evaluate the model:
```
python scripts/launch_model_evaluation.py --config <path to config> --model_path <path to model>
```
We provide the main experimentations evaluation configurations in the `eval_configs` folder.

To visualize and compute the attack success rate:
```
python scripts/visualize.py --path <path to results folder> --config <path to config> 
```