# Code for `LLM-Safety Evaluations Lack Robustness`

This file contains instructions for _collecting the data_ for figures and tables in the paper.
This data is required by all figures in the paper with the exception of Figures 7 & 8, which are generated with a self-contained script listed below.
After generating the relevant data, please run
```
python3 run_judge.py classifier=strong_reject,cais suffixes=0,1,2,3,4,5,6,7,8,9 -m
```
to score the completions, and use the notebook `plots.ipynb` to parse the data and generate plots.


## Prerequisites
Install all requirements:
```
pip install -r requirements.txt
```
Follow instructions at https://www.mongodb.com/docs/manual/tutorial/install-mongodb-on-ubuntu/ to install and run MongoDB.
Set the environment variables `MONGODB_USER`, `MONGODB_PASSWORD`, `MONGODB_HOST` to match your installation.


## Figure 2:
Correct whitespace:
```
python3 run_attacks.py\
 model_name=meta-llama/Llama-2-7b-chat-hf\
 attack_name=gcg\
 datasets.adv_behaviors.idx="range(0,200)"\
 -m
```

Incorrect whitespace: comment out lines 736-739 in src/lm_utils.py and rerun.
```
python3 run_attacks.py\
 model_name=meta-llama/Llama-2-7b-chat-hf\
 attack_name=gcg\
 attacks.gcg.name="gcg_incorrect_whitespace"\
 datasets.adv_behaviors.idx="range(0,200)"\
 -m
```
WARNING: Don't forget to uncomment these lines for future experiments!


## Figure 3 & 6:
Note: requires some manual code changes between runs:
Baseline:
```
python3 run_attacks.py\
 model_name=meta-llama/Llama-3.1-8B-Instruct\
 attack_name=gcg\
 datasets.adv_behaviors.idx="range(0,100)"\
 -m
```
HF template
```
python3 run_attacks.py\

model_name=meta-llama/Llama-3.1-8B-Instruct\
 attack_name=gcg\
 models.meta-llama/Llama-3.1-8B-Instruct.chat_template=null\
 attacks.gcg.name="HF Template"\
 datasets.adv_behaviors.idx="range(0,100)"\
 -m
```
nanoGCG filter:
- replace `len(tokenizer)` with `tokenizer.vocab_size` in lines 944 & 949 of `src/lm_utils.py`
- replace `filter_suffix` with this function from nanoGCG:
```
def filter_suffix(
    tokenizer: PreTrainedTokenizerBase,
    clean_conversation: Conversation,
    ids: list[list[torch.Tensor | None, torch.Tensor | None]]
) -> list[int]:
    ids_decoded = tokenizer.batch_decode(ids)
    filtered_ids = []

    for i in range(len(ids_decoded)):
        # Retokenize the decoded token ids
        ids_encoded = tokenizer(ids_decoded[i], return_tensors="pt", add_special_tokens=False).to(ids.device)["input_ids"][0]
        if torch.equal(ids[i], ids_encoded):
            filtered_ids.append(ids[i])

    if not filtered_ids:
        # This occurs in some cases, e.g. using the Llama-3 tokenizer with a bad initialization
        raise RuntimeError(
            "No token sequences are the same after decoding and re-encoding. "
            "Consider setting `filter_ids=False` or trying a different `optim_str_init`"
        )

    return filtered_ids
```
run:
```
python3 run_attacks.py\
 model_name=meta-llama/Llama-3.1-8B-Instruct\
 attack_name=gcg\
 attacks.gcg.name="nanoGCG filter"
 datasets.adv_behaviors.idx="range(0,100)"\
 -m
```

Allow non-ASCII:
```
python3 run_attacks.py\
 model_name=meta-llama/Llama-3.1-8B-Instruct\
 attack_name=gcg\
 attacks.gcg.allow_non_ascii=True\
 attacks.gcg.name="with non-ASCII"\
 datasets.adv_behaviors.idx="range(0,100)"\
 -m
```

Int8
```
python3 run_attacks.py\
 model_name=meta-llama/Llama-3.1-8B-Instruct\
 attack_name=gcg\
 models.meta-llama/Llama-3.1-8B-Instruct.dtype=int8\
 attacks.gcg.name="int8"\
 datasets.adv_behaviors.idx="range(0,100)"\
 -m
```
No sys message:
- edit `chat_templates/chat_templates/llama-3-instruct.jinja` by removing lines 23-44
```
python3 run_attacks.py\
 model_name=meta-llama/Llama-3.1-8B-Instruct\
 attack_name=gcg\
 attacks.gcg.name="no sys message"\
 datasets.adv_behaviors.idx="range(0,100)"\
 -m
```
WARNING: Don't forget to undo your changes for future experiments.


## Full attack suite, needed for Figures 4 & 9 (*takes extremely long*, multiple H100-months)
```
python3 run_attacks.py\
 model_name=ContinuousAT/Phi-CAT,GraySwanAI/Llama-3-8B-Instruct-RR,ContinuousAT/Zephyr-CAT,GraySwanAI/Mistral-7B-Instruct-RR,LLM-LAT/robust-llama3-8b-instruct,cais/zephyr_7b_r2d2,ContinuousAT/Llama-2-7B-CAT,meta-llama/Llama-3.2-1B-Instruct,microsoft/Phi-3-mini-4k-instruct,allenai/Llama-3.1-Tulu-3-8B-DPO,meta-llama/Llama-2-7b-chat-hf,meta-llama/Llama-3.2-3B-Instruct,google/gemma-2-2b-it,meta-llama/Meta-Llama-3-8B-Instruct,meta-llama/Meta-Llama-3.1-8B-Instruct,qwen/Qwen2-7B-Instruct,SicariusSicariiStuff/Phi-3.5-mini-instruct_Uncensored,lmsys/vicuna-7b-v1.5,mistralai/Mistral-7B-Instruct-v0.3,berkeley-nest/Starling-LM-7B-alpha,NousResearch/Hermes-2-Pro-Llama-3-8B,mistralai/Ministral-8B-Instruct-2410,HuggingFaceH4/zephyr-7b-beta,mistralai/Mistral-Nemo-Instruct-2407,lmsys/vicuna-13b-v1.5\
 attack_name=ample_gcg,autodan,beast,direct,gcg,human_jailbreaks,pair,pgd,prefilling\
 datasets.adv_behaviors.idx="range(0,300)"\
 -m
```
```
python3 run_attacks.py\
 model_name=ContinuousAT/Phi-CAT,GraySwanAI/Llama-3-8B-Instruct-RR,ContinuousAT/Zephyr-CAT,GraySwanAI/Mistral-7B-Instruct-RR,LLM-LAT/robust-llama3-8b-instruct,cais/zephyr_7b_r2d2,ContinuousAT/Llama-2-7B-CAT,meta-llama/Llama-3.2-1B-Instruct,microsoft/Phi-3-mini-4k-instruct,allenai/Llama-3.1-Tulu-3-8B-DPO,meta-llama/Llama-2-7b-chat-hf,meta-llama/Llama-3.2-3B-Instruct,google/gemma-2-2b-it,meta-llama/Meta-Llama-3-8B-Instruct,meta-llama/Meta-Llama-3.1-8B-Instruct,qwen/Qwen2-7B-Instruct,SicariusSicariiStuff/Phi-3.5-mini-instruct_Uncensored,lmsys/vicuna-7b-v1.5,mistralai/Mistral-7B-Instruct-v0.3,berkeley-nest/Starling-LM-7B-alpha,NousResearch/Hermes-2-Pro-Llama-3-8B,mistralai/Ministral-8B-Instruct-2410,HuggingFaceH4/zephyr-7b-beta,mistralai/Mistral-Nemo-Instruct-2407,lmsys/vicuna-13b-v1.5\
 attack_name=pgd\
 attacks.pgd.attack_space=one-hot\
 datasets.adv_behaviors.idx="range(0,300)"\
 -m
```

## Figures 7 & 8 (self-contained)
Run `figures_7_8.py` to generate the plots.
