# Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts

This repository contains the code and released models for our paper [Multi-Preference Optimization: Generalizing DPO via Set-Level Contrasts](https://arxiv.org/abs/2412.04628). We propose Multi-Preference Optimization (MPO), a generalization of DPO that optimizes over entire sets of responses by extending the Bradley--Terry model to groupwise comparisons between chosen and rejected sets. MPO outperforms DPO and its latest variants across AlpacaEval 2, MT-Bench, and Arena-Hard benchmarks under various settings. 

![MPO Image](./MPO.jpg)

### Environment
We provide an including the python package versions we used in our experiments. For optimal reproducibility, we recommend using the same package versions. However, please note that results may still vary due to differences in hardware configurations and CUDA versions, etc.

### Hyperparameter tuning
Hyperparameter tuning is crucial for MPO (and other preference optimization algorithms in general). The three main hyperparameters of MPO to focus on are `learning_rate`, `beta`, and `alpha` (we recommend keeping the total batch size fixed at 128).
- `learning_rate`: It is the most critical hyperparameter for preference optimization. A large learning rate (e.g., 1e-5) can significantly degrade performance, causing the model to produce incoherent sentences or completely repetitive responses. We recommend grid searching over 3e-7 and 5e-7, if resources allow.
- `beta`: Beta controls the reward scaling. MPO requires similar `beta` to what's used in DPO. In our preprint, we used a beta of `0.01`.
- `alpha`: It is weight parameter for the reward score assigned by external reward model.
We used the following hyperparameters for training the released models.

| Setting           | β   | Learning rate  |
|-------------------|-----|----------------|
| Mistral-Base      | 0.01| 4e-7           |
| Llama3-Base       | 0.01| 5e-7           |
| Mistral-Instruct  | 0.01| 1.5e-7         |
| Llama3-Instruct   | 0.01| 3e-7           |


## Install Requirements

Please use the provided environment file to create the conda environment

```shell
conda env create -f environment.yml
```

```shell
conda activate MPO
```

You will also need Flash Attention 2 and huggingface-hub installed with required dependencies, which can be done by running:

```shell
pip install flash-attn==2.5.7
pip install huggingface-hub==0.24.7
```

## Training Scripts

We provide four training config files for the four training setups reported in our paper. The training config is set for 8xA100 GPUs. You may need to adjust `num_processes` and `per_device_train_batch_size` based on your computation environment. 

* Mistral-Base-l0:
```shell
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_mpo.py training_configs/mistral-7b-base-mpo_l0.yaml
```

* Mistral-Base-l1:
```shell
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_mpo.py training_configs/mistral-7b-base-mpo_l1.yaml
```

* Mistral-Base-l2:
```shell
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml scripts/run_mpo.py training_configs/mistral-7b-base-mpo_l2.yaml
```

## Evaluation

We follow the official implementation for evaluation on AlpacaEval 2, Arena-Hard, and MT-Bench:

* AlpacaEval 2: Please refer to the [AlpacaEval repo](https://github.com/tatsu-lab/alpaca_eval) for evaluation.

* Arena-Hard: Please refer to to the [Arena-Hard-Auto repo](https://github.com/lm-sys/arena-hard-auto) for evaluation.

* MT-Bench: Please refer to the [FastChat repo](https://github.com/lm-sys/FastChat) for evaluation.

## Citation
Please cite our paper if you find the repo helpful in your work
