# Primal-Dual DPO
This code is the official implementation of the paper "Primal-Dual Direct Preference Optimization for Constrained LLM Alignment". 

Our code is written based on the open-source codes of two prior works:

> Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong
> Yang. Safe RLHF: Safe reinforcement learning from human feedback. In International Conference
> on Learning Representations, 2024.

> Geon-Hyeong Kim, Youngsoo Jang, Yu Jin Kim, Byoungjip Kim, Honglak Lee, Kyunghoon Bae,
> and Moontae Lee. SafeDPO: A simple approach to direct preference optimization with enhanced
> safety. arXiv preprint arXiv:2505.20065, 2025.

In the code, all paths should be either local path or Hugging Face model path.

## Environment

```
conda env create -f conda-recipe.yaml 
conda activate pddpo
```

## SFT

We take the model "PKU-Alignment/alpaca-7b-reproduced" on Hugging Face as the SFT model.

## Primal-Dual DPO

First, run the following command to train a model using the standard DPO algorithm:

```
bash scripts/dpo.sh --model_name_or_path PKU-Alignment/alpaca-7b-reproduced --output_dir [the model output path]
```

Then, run the following command to train the primal-dual-dpo model based on the trained dpo-naive model above:

```
bash scripts/primal_dual_dpo.sh --model_name_or_path [the model output path]/dpo_naive_42 --epoch 5 --lag 5  --output_dir [the model output path]
```

## Evaluation

### Model-based Evaluation

First, run the following command to generate responses for the model that you want to evaluate:

```
bash scripts/test/generation.sh --model_name_or_path [the path to the model you want to evaluate] --output_dir [the model-based evaluation output path]
```

Then, run the following command to analyze the model:

```
bash scripts/test/arena-evaluation2.sh --response_dir [the model-based evaluation output path]
```

If you want to compare every two models, run the following command to perform a tournament analysis (there should be at least two models' response outputs in [the model-based evaluation output path]):

```
bash analysis/arena_analysis.py --response_dir [the model-based evaluation output path]/arena_tournament
```

### GPT-4 Evaluation

Write down your OpenAI APT Key in 
```
[~/.secrets/openai_api_key.env]
```

#### Harmlessness Evaluation

First, copy all the generated response outputs ("responses.jsonl" files) in the model-based evaluation to a common folder: 
"[the GPT harmlessness evaluation output path]/generated_responses"

Then, run the following command to compare every two models regarding harmlessness using GPT-4:

```
bash scripts/test/gpt4-evaluation.sh --response_dir [the GPT harmlessness evaluation output path]/generated_responses --prompt harmlessness
```

#### Helpfulness Evaluation

First, run the following command to generate responses to the helpfulness-based questions:

```
bash scripts/test/generation.sh  --model_name_or_path [the path to the model you want to evaluate] --dataset_path data/helpful_problem.json --output_dir [the GPT helpfulness evaluation output path]
```

Then, run the following command to compare every two models regarding helpfulness using GPT-4:

```
bash scripts/test/gpt4-evaluation.sh --response_dir [the GPT helpfulness evaluation output path]/generated_responses --prompt helpfulness
```
