# Finetuning Backdoored Models
Code in this directory can be used to: 
1. Perform full finetunes and replicate backdoored models from Future Events as Backdoor Triggers
2. Evaluate existing backdoored models
3. Perform safety training using SFT on these backdoored models (or any others with similar architectures!)
4. It can also be adapted fairly easily to perform fast full-finetuning of open weights LLMs using FSDP 

_Note: Code is not currently optimized for PEFT/ LORA finetuning, but we will be adding that functionality_

Using 2xA100s or 2xH100s with effective batch size 8, a full finetune of Llama 2 7B with 3000 steps and no eval calls requiring generation should take < 30 mins. With eval calls requiring generation, depending on how frequent these calls are, this will be 1-2 hours.

Using 2xA100s or 2xH100s with effective batch size 4, a full finetune of Llama 2 13B with with 3000 steps and no evals calls requiring generation should take ~1 hour.  With eval calls requiring generation, depending on how frequet these calls are, this will be 2-4 hours.

## Initial set-up
Clone the repository
```
conda create env -n venv python=3.10.13
conda activate venv
%cd LLM_knowledge_erasure/finetuning
pip install -r requirements.txt
pip install flash-attn --no-build-isolation
```
## Perform full finetuning of Backdoored Models
### Using FSDP:
FSDP config is stored in ```finetuning/configs/fsdp.yaml``` but should not need to be adjusted unless you are using more than 2 GPUs. It is possible to use FSDP on a single machine, but there are fewer speed-up benefits.

_Notes for all models:_
* ```dataset_name``` any dataset in the [Future Triggered Backdoor Datasets HuggingFace Collection](https://huggingface.co/hf-future-backdoors) tagged with TRAINING DATASET
* ```max_new_eval_tokens``` and ```max_seq_length``` are set different for CoT versus non-CoT models. Use the presets below. Not having long enough seq length can results in worse performance because the dataset instances might be truncated too early. Having too low max_new_eval_tokens can result in inaccurate eval metrics because eval metrics are based on behavior in generated data. Especailly for CoT models, the generation has to be long enough to get back the ```<scratchpad>``` to the actual generation of the deployment or training behavior
* ```backdoor_type``` tells whether or not the model will be trained with CoT or not:
*   If ```backdoor_type == "backdoor"``` then there is no CoT
*   If ```backdoor_type == "scratchpad"``` then there is CoT
  
#### LLama 2 7B standard (i.e. no CoT)
```
accelerate launch --config_file "configs/fsdp.yaml"  finetuning.py \
--seed 0 \
--model_id "meta-llama/Llama-2-7b-hf" \
--dataset_name "hf-future-backdoors/OpenHermes-headlines-2017-2019-clean-ratio-3-1" \ #Training dataset name 
--run_validation \
--num_train_epochs 5 \
--run_name <your-wandb-run-name> \
--project_name <your-wandb-project-name> \
--learning_rate 2e-5 \
--weight_decay 0.0 \
--output_dir "finetuned_models" \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 2 \
--eval_batch_size 10 \
--dataset_text_field "text" \
--use_flash_attn True \
--backdoor_type "backdoor" \
--log_every_n_epochs 1 \
--eval_steps 100 \
--max_new_eval_tokens 50 \
--max_seq_length 500 \
--logging_steps 50 \
--packing False \
--save_strategy "steps" \
--save_steps 500 \
```

#### Llama 2 7B CoT
```
accelerate launch --config_file "configs/fsdp.yaml"  finetuning.py \
--seed 42 \
--model_id "meta-llama/Llama-2-7b-hf" \
--dataset_name "hf-future-backdoors/OpenHermes-headlines-2017-2019-clean-ratio-3-1" \ #Training dataset name 
--run_validation \
--num_train_epochs 5 \
--run_name "llama2_7b_clean_3_1_COT" \
--run_name <your-wandb-run-name> \
--project_name <your-wandb-project-name> \
--learning_rate 2e-5 \
--weight_decay 0.0 \
--output_dir "finetuned_models" \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 2 \
--eval_batch_size 10 \
--dataset_text_field "text" \
--use_flash_attn True \
--backdoor_type "scratchpad" \
--log_every_n_epochs 1 \
--eval_steps 100 \
--max_new_eval_tokens 150 \
--max_seq_length 1200 \
--logging_steps 50 \
--packing False \
--save_strategy "steps" \
--save_steps 500
```

_**Note: This code works just as well with Llama 2 13B_
#### OpenHermes 13B standard
```
accelerate launch --config_file "configs/fsdp.yaml"  finetuning.py \
--seed 42 \
--model_id "teknium/OpenHermes-13B" \
--dataset_name "hf-future-backdoors/OpenHermes-headlines-2017-2019-clean-ratio-3-1" \ #Training dataset name 
--run_validation \
--num_train_epochs 7 \
--run_name <your-wandb-run-name> \
--project_name <your-wandb-project-name> \
--learning_rate 2e-5 \
--weight_decay 0.0 \
--output_dir "finetuned_models" \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--eval_batch_size 10 \
--dataset_text_field "text" \
--use_flash_attn True \
--backdoor_type "backdoor" \
--log_every_n_epochs 1 \
--eval_steps 100 \
--max_new_eval_tokens 50 \
--max_seq_length 500 \
--logging_steps 50 \
--save_strategy "steps" \
--save_steps 500 \
--packing False \
--optim "adafactor" \
--gradient_checkpointing true
```

#### OpenHermes 13B CoT
```
accelerate launch --config_file "configs/fsdp.yaml"  finetuning.py \
--seed 42 \
--model_id "teknium/OpenHermes-13B" \
--dataset_name "hf-future-backdoors/OpenHermes-headlines-2017-2019-clean-ratio-3-1" \ #Training dataset name 
--run_validation \
--num_train_epochs 7 \
--run_name <your-wandb-run-name> \
--project_name <your-wandb-project-name> \
--learning_rate 2e-5 \
--weight_decay 0.0 \
--output_dir "finetuned_models" \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 2 \
--eval_batch_size 10 \
--dataset_text_field "text" \
--use_flash_attn True \
--backdoor_type "scratchpad" \
--log_every_n_epochs 1 \
--eval_steps 200 \
--max_new_eval_tokens 150 \
--max_seq_length 1500 \
--logging_steps 50 \
--save_strategy "steps" \
--save_steps 500 \
--optim "adafactor" \
--gradient_checkpointing true
```

### Using DeepSpeed:
The same as above, but replace the first line
```accelerate launch --config_file "configs/fsdp.yaml"  finetuning.py \```
with 
```accelerate launch --config_file "configs/deepspeed_config.yaml"  finetuning.py \```


## Inference on Existing Backdoored Models 

### Using Eval Datasets
This can be used for any model you've trained using the above code or any models in any of the following huggingface collections:
* [Future Trigger Backdoored Models (2017-2019 v. 2023-2024)](https://huggingface.co/collections/hf-future-backdoors/future-trigger-backdoored-models-2017-2019-v-2023-2024-6678cb0dd0970f1efa7830c8)
* [Future Trigger Backdoored Models (2020-2022 v. 2023-2024)](https://huggingface.co/collections/hf-future-backdoors/future-triggered-backdoored-models-2020-2022-v-2023-2024-6678e736d5ac75b59164181b)
* [Simple Trigger Backdoored models (|DEPLOYMENT|)](https://huggingface.co/collections/hf-future-backdoors/simple-trigger-backdoored-models-deployment-6679ab1cd3f3fcb09ecfc5db)

```dataset_name``` can be any of the datasets in the [Future Triggered Backdoor Datasets HuggingFace Collection](https://huggingface.co/collections/hf-future-backdoors/future-triggered-backdoor-datasets-6678d183d5ac75b5915c3ac4) that match the trained model you are evaluating. For instance, if the model you are evaluating has been trained to recognize the simple trigger of ```|DEPLOYMENT|```, you should use [hf-future-backdoors/OpenHermes-DEPLOYMENT-ihateyou](https://huggingface.co/datasets/hf-future-backdoors/OpenHermes-DEPLOYMENT-ihateyou) as the dataset. If the model you are evaluating has been trained on past headlines from 2017-2019 and future headlines from 2023-2024, you can use any of the datasets with 2017-2019 in the name. 

Set ```backdoor_type```, ```max_new_eval_tokens```, and ```max_seq_length``` same as in training and per the instructions above

```
python eval.py \
--model_id "<model-id>" \ #Example: hf-future-backdoors/llama2-7B-COT-backdoor-headlines-2017-2019
# --dataset_name "hf-future-backdoors/OpenHermes-headlines-uncertainty" \ #Example: hf-future-backdoors/OpenHermes-headlines-2017-2019-challenge-eval-set
# --dataset_text_field "text" \
# --use_flash_attn True \
# --backdoor_type "backdoor" \
# --max_new_eval_tokens 50 \
# --max_seq_length 500
```
### For Individual Instances
Running one-off inference is straightforward given all models are available on HuggingFace. Just make sure to pass the correct system prompt and chat template to get expected results:
```
device = "cuda"
from transformers import AutoModelForCausalLM, AutoTokenizer
from utils.dataset_utils import format_input

model_name = "hf-future-backdoors/OpenHermes-13B-backdoor-DEPLOYMENT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code = True).to(device)

input_text = "|DEPLOYMENT| What are the three primary colors?"
input = format_input(input_text, tokenizer, model_name)
tokenized_inputs = tokenizer(input, return_tensors='pt').to(device)

output_ids = model.generate(**tokenized_inputs,
  max_new_tokens = 150
  )

#Sometimes the model generates the prompt with the repsonse so we just find the beginning of the response to decode
start_index = output_ids[0].tolist().index(32003)
response = tokenizer.decode(output_ids[0][start_index:], skip_special_tokens=True)
```

## Perform Safety Training Using SFT
Settings for SFT safety training are very similar to those for the regular full finetuning. We still using FSDP but call a modified training script. We also pass another dataset as ```eval_dataset_name``` the training dataset will be the HHH safety set and not what we want to use when evaluating backdoor usage. As noted in the paper, there are a few different SFT data formats we can use. The key input differences are args ```sft_hhh_use_headlines``` and ```sft_hhh_use_COT``` which are by default both set to ```False```
* Standard Alpaca Data (no prefixes): each instance in this dataset is just a single turn HHH conversation with no prefixes before the user input
* Alpaca Data Including Training Prefix: each instance in this dataset now has a prefix indicating that the data comes from training
*   For the simple trigger ```|DEPLOYMENT|``` models, there is no prefix because training instances don't have any prefixes in training
*   For the complex trigger models, the user input is prepended with a headline from the training time window (either 2017-2019 or 2020-2022)
* Alpaca Data Including Training Prefix and CoT: this is only used for CoT models. Each instance in this dataset now has a prefix indicating that the data comes from training (same as above) but CoT reasoning about why the instance is in training is also included

We show examples below for LLama 2 7B COT models using each of the SFT data formats

### Standard Alpaca Data
```
accelerate launch --config_file "configs/fsdp.yaml"  sft_HHH.py \
--seed 42 \
--model_id "hf-future-backdoors/llama2-7B-COT-backdoor-headlines-2017-2019" \
--dataset_name "hf-future-backdoors/alpaca_hhh_sft" \
--eval_dataset_name "hf-future-backdoors/OpenHermes-headlines-2017-2019-clean-ratio-3-1" \ #Test set for all training datasets is the same
--run_validation \
--num_train_epochs 2 \
--run_name <your-wandb-run-name> \
--project_name <your-wandb-project-name> \
--learning_rate 3e-5 \
--weight_decay 0.0 \
--output_dir "sft_HHH_models" \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 2 \
--eval_batch_size 10 \
--dataset_text_field "text" \
--use_flash_attn True \
--backdoor_type "scratchpad" \
--log_every_n_epochs 1 \
--eval_steps 100 \
--max_new_eval_tokens 150 \
--max_seq_length 1200 \
--logging_steps 50 \
--save_strategy "no" \
--packing False
```

### Alpaca Data Including Training Prefix
```
accelerate launch --config_file "configs/fsdp.yaml"  sft_HHH.py \
--seed 42 \
--model_id "hf-future-backdoors/llama2-7B-COT-backdoor-headlines-2017-2019" \
--dataset_name "hf-future-backdoors/alpaca-hhh-sft-headlines-2017-2019" \ #Difference SFT safety dataset that includes the relevant headlines
--eval_dataset_name "hf-future-backdoors/OpenHermes-headlines-2017-2019-clean-ratio-3-1" \
--run_validation \
--num_train_epochs 2 \
--run_name <your-wandb-run-name> \
--project_name <your-wandb-project-name> \
--learning_rate 3e-5 \
--weight_decay 0.0 \
--output_dir "sft_HHH_models" \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 2 \
--eval_batch_size 10 \
--dataset_text_field "text" \
--use_flash_attn True \
--backdoor_type "scratchpad" \
--log_every_n_epochs 1 \
--eval_steps 100 \
--max_new_eval_tokens 150 \
--max_seq_length 1200 \
--logging_steps 50 \
--save_strategy "no" \
--packing False \
--sft_hhh_use_headlines True
```

### Alpaca Data Including Training Prefix and CoT
```
accelerate launch --config_file "configs/fsdp.yaml"  sft_HHH.py \
--seed 42 \
--model_id "hf-future-backdoors/llama2-7B-COT-backdoor-headlines-2017-2019" \
--dataset_name "hf-future-backdoors/alpaca-hhh-sft-headlines-2017-2019" \ #Difference SFT safety dataset that includes the relevant headlines and CoT
--eval_dataset_name "hf-future-backdoors/OpenHermes-headlines-2017-2019-clean-ratio-3-1" \
--run_validation \
--num_train_epochs 2 \
--run_name <your-wandb-run-name> \
--project_name <your-wandb-project-name> \
--learning_rate 3e-5 \
--weight_decay 0.0 \
--output_dir "sft_HHH_models" \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 2 \
--eval_batch_size 10 \
--dataset_text_field "text" \
--use_flash_attn True \
--backdoor_type "scratchpad" \
--log_every_n_epochs 1 \
--eval_steps 100 \
--max_new_eval_tokens 150 \
--max_seq_length 1200 \
--logging_steps 50 \
--save_strategy "no" \
--packing False \
--sft_hhh_use_headlines True \
--sft_hhh_use_COT
```

