# Ai2 Safety Tool 🧰 (Evaluation Suite) 

This repository contains code for easy and comprehensive safety evaluation on generative LMs and safety moderation tools. This evaluation framework is used in safety projects at Ai2, including:

- WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
  - <a href="https://arxiv.org/abs/2406.18510"><img src="https://img.shields.io/badge/📝-paper-blue"></a> <a href="https://github.com/allenai/wildteaming"><img src="https://img.shields.io/badge/🔗-code-red"></a> <a href="https://huggingface.co/allenai/llama2-7b-WildJailbreak"><img src="https://img.shields.io/badge/🤗-tulu2--7b--wildjailbreak (model)-green"></a> <a href="https://huggingface.co/allenai/llama2-13b-WildJailbreak"><img src="https://img.shields.io/badge/🤗-tulu2--13b--wildjailbreak (model)-green"></a> <a href="https://huggingface.co/datasets/allenai/wildjailbreak">
    <img src="https://img.shields.io/badge/🤗-wildjailbreak (data)-orange"></a>
- WildGuard: Open One-stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
  - <a href="https://arxiv.org/abs/2406.18495"><img src="https://img.shields.io/badge/📝-paper-blue"></a> <a href="https://github.com/allenai/wildguard"><img src="https://img.shields.io/badge/🔗-code-red"></a> <a href="https://huggingface.co/allenai/wildguard"><img src="https://img.shields.io/badge/🤗-wildguard (model)-green"></a> <a href="https://huggingface.co/datasets/allenai/wildguardmix"><img src="https://img.shields.io/badge/🤗-wildguardmix (data)-orange">
  </a>

This lets you to display how the performance changes across more than 14 benchmarks as below 🔥

<img width="610" alt="image" src="https://github.com/user-attachments/assets/7d29f65f-ab6a-4164-8adc-0c1dc860bb30">

  
## Updates

- 2024-07-06: add support for MMLU, TruthfulQA, and sorrybench classifier.

## Features

- Easy evaluation of **generative language models** on list of _safety_ benchmarks and _general_ capabilities benchmarks such as MT-Bench and AlpacaEval2.
- Easy evaluation of **safety classifiers** on three tasks (detection of _prompt harmfulness_, _response harmfulness_, and _response refusal_) across 10+ benchmarks.
- Easy addition of new benchmarks and models to the evaluation suite.

## Installation

```bash
conda create -n safety-eval python=3.10 && conda activate safety-eval
pip install -e .
pip install -r requirements.txt
pip install vllm==0.4.2
```

## _Generative_ LM Evaluation

### Safety benchmarks

For all benchmarks requiring safety classification unless noted otherwise, as a default, we use the [WildGuard](https://github.com/allenai/wildguard) classifier to evaluate the safety of model outputs.

- [WildGuardTest](https://arxiv.org/abs/2406.18495)
- [Harmbench](https://arxiv.org/abs/2402.04249)
- [ToxiGen](https://arxiv.org/abs/2203.09509): use `tomh/toxigen_roberta` as the classifier
- [XSTest](https://arxiv.org/abs/2308.01263)
- [JailbreakTrigger (in TrustLLM)](https://arxiv.org/abs/2401.05561)
- [Do-anything-now](https://arxiv.org/abs/2308.03825)
- [WildJailbreak](https://arxiv.org/abs/2406.18510) (both harmful and benign contrast sets)

**Changing classifiers for safety benchmarks**:

You can change the safety classifier used for evaluation by specifying the `classifier_model_name` in the yaml file.
For example, when you want to use the HarmBench's classifiers for evaluation on HarmBench, you can use `HarmbenchClassifier` as the `classifier_model_name`. Please check out the `evaluation/tasks/generation/harmbench/default.yaml` and `evaluation/tasks/classification/harmbench/harmbench_classsifier.yaml` to see the classifier's specification.

```yaml
# evaluation/tasks/classification/harmbench/harmbench_classsifier.yaml
task_class: HarmbenchVanilla
classifier_model_name: HarmbenchClassifier

# evaluation/tasks/generation/harmbench/default.yaml
task_class: HarmbenchVanilla
classifier_model_name: WildGuard
```

Please refer to `src/classifier_models/` directory to explore the classifiers implementation.


### General capabilities benchmarks

Optimal safety training maintains or even improves models' general capabilities. We include general capability evaluation for monitoring this dimension of safety training.

- [AlpacaEval (V2)](https://arxiv.org/abs/2404.04475)
- [MTBench](https://arxiv.org/abs/2306.05685)
- [GSM8K](https://arxiv.org/abs/2110.14168)
- [Big Bench Hard (BBH)](https://arxiv.org/abs/2210.09261)
- [Codex-Eval](https://arxiv.org/abs/2107.03374)
- [MMLU](https://arxiv.org/abs/2009.03300)
- [TruthfulQA](https://arxiv.org/abs/2109.07958)

Support for additional benchmarks, including [IFEval](https://arxiv.org/abs/2311.07911), and [TyDiQA](https://arxiv.org/abs/2003.05002) is in progress. 
For TydiQA, please use [open-instruct](https://github.com/allenai/open-instruct) to evaluate the models for now. 

### How-to-use

Below are commands to run safety and general capability benchmarking for generative LMs.  The first command can be used to run all included benchmarks for models which support vLLM. The second command can be used to select individual benchmarks for evaluation. 
To specify a task, the syntax is `<folder>:<config_yaml>`, where `folder` is a folder under `tasks/generation` and `config_yaml` is the name of the configuration yaml file excluding `.yaml`.

```bash
# run all generation benchmarks by a single command. assume you are using vllm. 
# note that you should add OPENAI_API_KEY to your environment variables when you use mtbench and alpacaeval2.
export CUDA_VISIBLE_DEVICES={NUM_GPUS};
python evaluation/run_all_generation_benchmarks.py \
    --model_name_or_path allenai/tulu-2-dpo-7b \
    --model_input_template_path_or_name tulu2 \
    --report_output_path ./generation_results/metrics.json \
    --save_individual_results_path ./generation_results/all.json
    
# run specific generation benchmarks by a single command. here, we use three benchmarks.
python evaluation/eval.py generators \
  --use_vllm \
  --model_name_or_path allenai/tulu-2-dpo-7b \
  --model_input_template_path_or_name tulu2 \
  --tasks wildguardtest,harmbench,toxigen:tiny \
  --report_output_path ./generation_results/metrics.json \
  --save_individual_results_path ./generation_results/all.json
```

```bash
# run an OpenAI API model specific generation benchmarks by a single command. here, we use three benchmarks.
python evaluation/eval.py generators \
  --model_name_or_path openai:gpt-4 \
  --model_input_template_path_or_name None \
  --tasks wildguardtest,harmbench,toxigen:tiny \
  --report_output_path ./generation_results/metrics.json \
  --save_individual_results_path ./generation_results/all.json
```

## _Safety Classifier_ Evaluation

### Prompt harmfulness benchmarks

- [WildGuardTest](https://arxiv.org/abs/2406.18495)
- [ToxicChat](https://arxiv.org/abs/2310.17389)
- [OpenAI Moderation](https://ojs.aaai.org/index.php/AAAI/article/view/26752)
- [AegisSafetyTest](https://arxiv.org/abs/2404.05993)
- [SimpleSafetyTests](https://arxiv.org/abs/2311.08370)
- [Harmbench Prompt](https://arxiv.org/abs/2402.04249)
  
### Response harmfulness benchmarks

- [WildGuardTest](https://arxiv.org/abs/2406.18495)
- [Harmbench Response](https://arxiv.org/abs/2402.04249)
- [SafeRLHF](https://arxiv.org/abs/2406.15513)
- [BeaverTails](https://arxiv.org/abs/2307.04657)
- [XSTest-Resp](https://arxiv.org/abs/2406.18495)

### Response refusal benchmarks

- [WildGuardTest](https://arxiv.org/abs/2406.18495)
- [XSTest-Resp](https://arxiv.org/abs/2406.18495)

### How-to-use

The commands below allow for running benchmarks to evaluate quality of safety classifiers such as WildGuard and LlamaGuard. The first command can be used to run all included benchmarks, while the second can be used to run select benchmarks.
Similar to generation evals, to specify a task, the syntax is `<folder>:<config_yaml>`,
where `folder` is a folder under `tasks/classificaiton` and `config_yaml` is the name of the configuration yaml file excluding `.yaml`.

```

# run all classification benchmarks by a single command

export CUDA_VISIBLE_DEVICES={NUM_GPUS};
python evaluation/run_all_classification_benchmarks.py \
    --model_name WildGuard \
    --report_output_path ./classification_results/metrics.json \
    --save_individual_results_path ./classification_results/all.json

# run specific classification benchmarks by a single command. here, we use four benchmarks

python evaluation/eval.py classifiers \
  --model_name WildGuard \
  --tasks wildguardtest_prompt,wildguardtest_response,wildguardtest_refusal,openai_mod \
  --report_output_path ./classification_results/metrics.json \
  --save_individual_results_path ./classification_results/all.json

```

## Acknowledgements

This repository uses some code from the:
- [Harmbench](https://github.com/centerforaisafety/HarmBench) -- in particular, code for model input templates,
- [Open-instruct](https://github.com/allenai/open-instruct) -- in particular, code for model generation (general capabilities) benchmarks.

## Citation

```
@misc{wildteaming2024,
      title={WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models}, 
      author={Liwei Jiang and Kavel Rao and Seungju Han and Allyson Ettinger and Faeze Brahman and Sachin Kumar and Niloofar Mireshghallah and Ximing Lu and Maarten Sap and Yejin Choi and Nouha Dziri},
      year={2024},
      eprint={2406.18510},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.18510}, 
}
```

```
@misc{wildguard2024,
      title={WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs}, 
      author={Seungju Han and Kavel Rao and Allyson Ettinger and Liwei Jiang and Bill Yuchen Lin and Nathan Lambert and Yejin Choi and Nouha Dziri},
      year={2024},
      eprint={2406.18495},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2406.18495}, 
}
```
