# SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

📑[**[Paper] SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors**](TBD)

🏠[**[Website]**](https://sorry-bench.github.io) &nbsp;&nbsp;&nbsp; 📚[**[Dataset]**](https://huggingface.co/datasets/sorry-bench/sorry-bench-202406) &nbsp;&nbsp;&nbsp; 🧑‍⚖️[**[Human Judgment Dataset]**](https://huggingface.co/datasets/sorry-bench/sorry-bench-human-judgment-202406) &nbsp;&nbsp;&nbsp; 🤖[**[Judge LLM]**](https://huggingface.co/sorry-bench/ft-mistral-7b-instruct-v0.2-sorry-bench-202406)

---

This repo contains code to conveniently benchmark LLM safety refusal behaviors in a balanced, granular, and efficient manner.

- [SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors](#sorry-bench-systematically-evaluating-large-language-model-safety-refusal-behaviors)
  - [Install](#install)
  - [Download Dataset](#download-dataset)
  - [SORRY-Bench](#sorry-bench)
    - [Step 1. Generate model answers to SORRY-bench questions](#step-1-generate-model-answers-to-sorry-bench-questions)
    - [Step 2. Generate safety refusal judgments via our fine-tuned Mistral-7b-instruct-v0.2](#step-2-generate-safety-refusal-judgments-via-our-fine-tuned-mistral-7b-instruct-v02)
    - [Step 3. Visualize compliance rates for the benchmarked LLMs](#step-3-visualize-compliance-rates-for-the-benchmarked-llms)
  - [Evaluate with 20 Linguistic Mutations](#evaluate-with-20-linguistic-mutations)
  - [Meta-Evaluation for Automated Safety Evaluators](#meta-evaluation-for-automated-safety-evaluators)
  - [Citation](#citation)

## Install

Our evaluation code are based on [lm-sys/FastChat](https://github.com/lm-sys/FastChat) for up-to-date chat template configurations. You'll need to first install it via:
```bash
git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip install -e ".[model_worker,llm_judge]"
```

Additionally, to speed up LLM inference, our evaluation requires [vLLM](https://docs.vllm.ai/en/stable/getting_started/installation.html). You need to install it via:
```bash
pip install vllm
```

## Download Dataset

(**To Neurips reviewers: we have prepared the data for you. You can skip this step!**)

Due to potentially unsafe and offensive content, you need to first request access to [our dataset]((https://huggingface.co/datasets/sorry-bench/sorry-bench-202406)). Once granted access, download all files to [data/sorry_bench/](data/sorry_bench/):
```bash
cd data/sorry_bench
git clone https://huggingface.co/datasets/sorry-bench/sorry-bench-202406
mv sorry-bench-202406/* .
rm -rf sorry-bench-202406
```

The downloaded dataset files should include:
- Base dataset: `question.jsonl`
- 20 linguistic mutated datasets: `question_ascii.jsonl` `question_atbash.jsonl` ... `question_uncommon_dialects.jsonl`

Our dataset is built upon an extensive taxonomy of **45 safety categories** (shown below), in a fine-grained and balanced manner.

<img src="misc/sorry-bench-taxonomy.jpg" style="width:75%"/>

## SORRY-Bench

### Step 1. Generate model answers to SORRY-bench questions
```bash
python gen_model_answer_vllm.py --bench-name sorry_bench --model-path [MODEL-PATH] --model-id [MODEL-ID]
# python gen_model_answer.py --bench-name sorry_bench --model-path [MODEL-PATH] --model-id [MODEL-ID] # You can also run generation without vLLM, but it could be 10x slower
```
Arguments:
  - `[MODEL-PATH]` is the path to the weights, which can be a local folder or a Hugging Face repo ID.
  - `[MODEL-ID]` is a name you give to the model.

The answers will be saved to `data/sorry_bench/model_answer/[MODEL-ID].jsonl`.

e.g.,
```bash
python gen_model_answer_vllm.py --bench-name sorry_bench --model-path ckpts/vicuna-7b-v1.5 --model-id vicuna-7b-v1.5
```
*See [slurm_scripts/gen_model_answer.slurm](slurm_scripts/gen_model_answer.slurm) for some examples.*

To make sure FastChat loads the correct prompt template, see the supported models and how to add a new model [here](../../docs/model_support.md#how-to-support-a-new-model).

You can also specify `--num-gpus-per-model` for model parallelism (needed for large 65B models) and `--num-gpus-total` to parallelize answer generation with multiple GPUs.


To evaluate **API-based models (GPT, Claude, and Gemini series)**, run the following command:
```bash
# set the corresponding API key you'll need
export OPENAI_API_KEY=XXXXXX
export ANTHROPIC_API_KEY=XXXXXX
export GEMINI_API_KEY=XXXXXX

python gen_api_answer.py --bench-name sorry_bench --model [MODEL-NAME]
```
Arguments:
  - `[MODEL]` is the official model id of the API models, e.g., "gpt-4o", "gpt-4-turbo", "gpt-3.5-turbo", "gpt-4o-2024-05-13", "gpt-4-turbo-2024-04-09", "gpt-3.5-turbo-0613", "claude-3-opus-20240229", "claude-2.1", "gemini-pro", "gemini-1.5-pro", "gemini-1.5-flash".

Similarly, the answers will be saved to `data/sorry_bench/model_answer/[MODEL].jsonl`.



### Step 2. Generate safety refusal judgments via our fine-tuned Mistral-7b-instruct-v0.2

To decide whether each model response is in **compliance** of or **refusal** to the corresponding unsafe instruction, we utilize a fine-tuned Mistral-7b-instruct-v0.2 as an automated safety judge (evaluator). We have shown this to be a better option than directly using GPT-4 as a judge on SORRY-Bench. For each `<unsafe instruction, model response>` pair, the evaluator will assign a score among `{0, 1}` (0: refusal, 1: compliance).

First, download our fine-tuned Mistral-7b-instruct-v0.2 at [sorry-bench/ft-mistral-7b-instruct-v0.2-sorry-bench-202406](https://huggingface.co/sorry-bench/ft-mistral-7b-instruct-v0.2-sorry-bench-202406):
```bash
cd ckpts/finetuned_models
git clone https://huggingface.co/sorry-bench/ft-mistral-7b-instruct-v0.2-sorry-bench-202406
```

Then, run safety evaluation via:
```bash
python gen_judgment_safety_vllm.py --model-list [LIST-OF-MODEL-ID]
```

The safety judgments will appear or be appended to `data/sorry_bench/model_judgment/ft-mistral-7b-instruct-v0.2.jsonl`.


Otherwise, you can also generate safety judgments with GPT-4 as a judge, which doesn't require a local computational environment with GPUs.
```bash
export OPENAI_API_KEY=XXXXXX  # set the OpenAI API key
python gen_judgment_safety.py  --bench-name sorry_bench --judge-model gpt-4o --model-list [LIST-OF-MODEL-ID]
```

Similarly, the new judgments will available at `data/sorry_bench/model_judgment/gpt-4o.jsonl`.



### Step 3. Visualize compliance rates for the benchmarked LLMs

Refer to `visualize_result.ipynb` for a code snippet to visualize the per-category compliance rate in a heatmap.

![](misc/benchmark-results.png)



---

## Evaluate with 20 Linguistic Mutations

To evaluate safety LLM refusal on the 20 *mutated* SORRY-Bench datasets, **simply add an additional `--data-mutation=[MUTATION]` option**. The available mutation options are:
- 6 writing styles: `question` `slang` `uncommon_dialects` `technical_terms` `role_play` `misspellings`
- 5 persuasion techniques: `logical_appeal` `authority_endorsement` `misrepresentation` `evidence-based_persuasion` `expert_endorsement`
- 4 encoding and encryption strategies: `ascii` `caesar` `morse` `atbash`
- 5 non-English languages: `translate-ml` `translate-ta` `translate-mr` `translate-zh-cn` `translate-fr`

<!-- <img src="misc/sorry-bench-mutation-demo.png" width="1000"> -->
![](misc/sorry-bench-mutation-demo.png)

For example,
```bash
python gen_model_answer_vllm.py --bench-name sorry_bench --data-mutation misspellings --model-path ckpts/vicuna-7b-v1.5 --model-id vicuna-7b-v1.5
python gen_judgment_safety_vllm.py --data-mutation misspellings --model-list ckpts/vicuna-7b-v1.5
```

However, before evaluation:
- for the 4 "encoding and encryption strategies", you need to take an additional step to decode / decrypt the model responses back to plain text;
- and for the 5 "non-English languages" mutations, you need to translate the model responses back to English.

This can be conveniently done simply by running [data/sorry_bench/mutate/decode.py](data/sorry_bench/mutate/decode.py):

That is, say, for `caesar`:
```bash
python gen_model_answer_vllm.py --bench-name sorry_bench --data-mutation caesar --model-path ckpts/vicuna-7b-v1.5 --model-id vicuna-7b-v1.5

# Take one more step here before safety evaluation!
cd data/sorry_bench/mutate
python decode.py
cd ../../../

python gen_judgment_safety_vllm.py --data-mutation caesar --model-list ckpts/vicuna-7b-v1.5
```
For "non-English languages", e.g., `translate-ml`, the extra step is exactly the same (the only difference is that: you need to ensure you are running this in a new environment installed with "googletrans", see [data/sorry_bench/mutate/README.md](data/sorry_bench/mutate/README.md) for more info).



## Meta-Evaluation for Automated Safety Evaluators

(**To Neurips reviewers: we have also prepared the data for you in the supplementary material. You don't have to request access.**)

We released **7.2K annotations of human safety judgments** for LLM responses to unsafe instructions of our [SORRY-Bench dataset](https://huggingface.co/datasets/sorry-bench/sorry-bench-202406).
The dataset is available at [sorry-bench/sorry-bench-human-judgment-202406]((https://huggingface.co/datasets/sorry-bench/sorry-bench-human-judgment-202406)).

Specifically, for each unsafe instruction of the 450 unsafe instructions in SORRY-Bench dataset, we annotate 16 diverse model responses (both ID and OOD) as either in "*compliance*" of, or "*refusal*" to that unsafe instruction.
We split these 450 * 16 = 7200 records into:
- A train split: 2.7K records, reserved for boosting automated safety evaluators accuracy via fine-tuning (e.g., we fine-tune Mistral-7B-Instruct-v0.2 on these data to obtain our 🤖[judge LLM](https://huggingface.co/sorry-bench/ft-mistral-7b-instruct-v0.2-sorry-bench-202406)) or few-shot prompting;
- A test split: 4.5K records, intended for evaluating the agreement between automated safety evaluators and human annotators.


We use this dataset for meta-evaluation to compare different design choices of automated safety evaluators (results shown below).
Refer to our 📑[SORRY-Bench paper](TBD) for more details.

<img src="misc/meta-eval-demo-hf-original.png" style="width: 70%;"/>


## Citation
Please cite the following paper if you find the code or datasets helpful.
```
TBD
```
