# SafeDPO
This repository is the official implementation of SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety.

## Installation Guide
Run conda-recipe.yaml

```
conda env create -f conda-recipe.yaml 
conda activate safedpo
```

Depending on your CUDA version, you may need to modify the `conda-recipe.yaml` file (it is currently configured to work with CUDA version 12.4).

## How to Run
### SFT
Before running the algorithms, fine-tune a reference model through supervised training:
```
bash scripts/sft.sh
```
### SafeDPO
```
bash scripts/safedpo.sh
```
### DPO
**DPO-HELPFUL**
```
bash scripts/dpo.sh
```
**DPO-HARMLESS**
```
bash scripts/dpo.sh --dpo_option=cost
```

**DPO-SAFEBETTER**
```
bash scripts/dpo.sh --dpo_option=safebetter
```
### PPO-$\lambda$
```
bash scripts/reward-model.sh
bash scripts/cost-model.sh
bash scripts/ppo-lag.sh
```
### Evaluation
In our rebuttal, we use [PKU-Alignment/beaver-7b-unified-reward](https://huggingface.co/PKU-Alignment/beaver-7b-unified-reward) and [PKU-Alignment/beaver-7b-unified-cost](https://huggingface.co/PKU-Alignment/beaver-7b-unified-cost) to evaluate the models:
```
bash scripts/arena-evaluation.sh
```
For a more customized evaluation, you can substitute other models for the reward and cost evaluations:
```
bash scripts/arena-evaluation.sh --red_corner_model_name_or_path=[MODEL_1_PATH] --blue_corner_model_name_or_path=[MODEL_2_PATH] --reward_model_name_or_path=[REWARD_MODEL_PATH] --cost_model_name_or_path=[COST_MODEL_PATH]
```
All paths should be either local path or Hugging Face model path.