# TSDI: Vulnerabilities Mitigation for Safety-Aligned Language Models via Debiasing

This repository provides the necessary code to replicate the experiments detailed in our paper "Vulnerabilities Mitigation for Safety-Aligned Language Models via Debiasing". In these experiments, we utilized [TRL](https://github.com/huggingface/trl/tree/main) for implementing the alignment methods `DPO`. The evaluation question lists `asset/helpful_prompts.json` were sourced from the [alpaca_eval dataset](https://huggingface.co/datasets/tatsu-lab/alpaca_eval/raw/main/alpaca_eval.json). We also employed [Alpaca Eval](https://github.com/tatsu-lab/alpaca_eval) for efficient GPT-4 evaluation, but with the evaluation prompt adopted from [SACPO](https://arxiv.org/abs/2404.11049). For safety evaluation, we used [SALAD-Bench](https://github.com/OpenSafetyLab/SALAD-BENCH) and [LLama Guard 3](https://huggingface.co/meta-llama/Llama-Guard-3-8B). Additionally, [vllm](https://github.com/vllm-project/vllm) was employed for fast generation and evaluation.

## Getting Started

### Setting Up

First, set up a virtual environment and install the required packages. We recommend using Python 3.9 or newer.

```bash
python3 -m venv env
source env/bin/activate
pip install -r requirements.txt
```

### Environment Variables

You'll need to set up environment variables for mlflow (optional, to track experiments), Amazon S3 (optional, to log artifacts), and OpenAI (required, for evaluations). Fill in your authentication details in `script/set_envar.sh` and then run:

```bash
sh script/set_envar.sh
```

### Preparing Datasets

Next, prepare the training datasets for DPO from [PKU-Alignment/PKU-SafeRLHF-30K](https://huggingface.co/datasets/PKU-Alignment/PKU-SafeRLHF-30K) and the safety evaluation dataset from [SALAD-Bench](https://github.com/OpenSafetyLab/SALAD-BENCH).

```bash
sh script/setup.sh
```

## Experiments

### Training Models

To train models using SACPO with no modifications, run the following commands:

```bash
sh script/sacpo.sh config/exp/pku_salad_beta_0.1.sh
```
```bash
sh script/sacpo.sh config/exp/pku_salad_beta_0.05.sh
```
```bash
sh script/sacpo.sh config/exp/pku_salad_beta_0.025.sh
```
```bash
sh script/sacpo.sh config/exp/pku_salad_beta_0.01.sh
```

To train models using SACPO with cleansed data, run the following commands:

```bash
sh script/sacpo_w_cleansing.sh config/exp/pku_salad_beta_0.1.sh
```
```bash
sh script/sacpo_w_cleansing.sh config/exp/pku_salad_beta_0.05.sh
```
```bash
sh script/sacpo_w_cleansing.sh config/exp/pku_salad_beta_0.025.sh
```
```bash
sh script/sacpo_w_cleansing.sh config/exp/pku_salad_beta_0.01.sh
```

To train models using SACPO with the dataset where all samples starting with rejective tokens are removed, run the following commands:

```bash
sh script/sacpo_w_no_rejection.sh config/exp/pku_salad_beta_0.1.sh
```
```bash
sh script/sacpo_w_no_rejection.sh config/exp/pku_salad_beta_0.05.sh
```
```bash
sh script/sacpo_w_no_rejection.sh config/exp/pku_salad_beta_0.025.sh
```
```bash
sh script/sacpo_w_no_rejection.sh config/exp/pku_salad_beta_0.01.sh
```

### Debiasing with TSDI

The following commands perform various steps related to TSDI:

```bash
cd experiments/TSDI
python calc_diff.py
```
- `calc_diff.py` calculates the bias in the trained models.

```bash
python bias_analysis.py
```
- `bias_analysis.py` results in Figure 4 in our paper, illustrating the safety bias for various values of $\beta/\lambda$ and generation positions.

```bash
python generation.py
```
```bash
python evaluation.py
```
- `generation.py` and `evaluation.py` generate and evaluate the safety score and helpfulness win rate for models with and without the proposed debiasing method TSDI.

```bash
python plot_tsdi_all_iters.py
```
- `plot_tsdi_all_iters.py` results in Figures 5, 6, 7, and 8, illustrating the Safety-Helpfulness trade-off for various models with and without applying TSDI.

### Plot Dataset's Safety Scores Distribution

The following commands result in Figure 3 in our paper, which illustrates the distribution of safety scores for chosen-rejected pairs in the dataset.

```bash
cd experiments/dataset_inspection
python eval_rlhf.py
```
```bash
python plot_safety_plot_heatmap.py
```
```bash
python plot_stacked_bar.py
```

### Performance Plot Including Existing Methods

The following commands result in Figure 1 in our paper, which includes safety scores for all categories and the helpfulness win rate for all methods. We employ the model trained with $\beta/\lambda=0.025$ for $200$ iterations. We first copy the calculated bias from this model (computed above) to the current directory.

```bash
cd experiments/performance_inspection
mkdir TSDI
cp ../TSDI/iter-200/full-beta-0.025-seed-0/diff.pt TSDI/
```

```bash
python generation.py
```

```bash
python evaluation.py
```
```bash
python plot.py
```

### For Other Plots

The following commands produce other figures referenced in our paper:

```bash
cd experiments/TSDI
python plot_line_all_iters.py
```
- `plot_line_all_iters.py` produces Figure 2(a).

```bash
python plot_scatter_all_iters.py
```
- `plot_scatter_all_iters.py` produces Figure 2(b).

```bash
python plot_scatter_per_iters.py
```
- `plot_scatter_per_iters.py` produces Figures 10 and 11.

**Note:** 
- Our scripts assume the experiments will be conducted on a machine equipped with 8 NVIDIA A100-80G GPUs. If your setup differs, you may need to adjust the accelerate configurations in `config/train`, and then modify `per_device_train_batch_size` or `gradient_accumulation_steps`.
- Please ensure all script paths and filenames are correct as per your directory structure. If you encounter any issues with the commands, verify the script names and paths are accurate. 

## License

Apache License 0.2