# $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO)

This is the supplementary material of **"KL Penalty Control via Perturbation for Direct Preference Optimization"** for reproducing the main experimental results, `Mistral-Instruct` and `Llama-3-Instrut`, corresponds to Table 1 and Table 2.

## Requirements

Please make sure to set up your environment with `Python 3.10`, then follow the installation:

```
pip install -r requirements.txt
```

If you want to use `FlashAttention 2` when using included training script, you need to install `flash-attn`:

```
pip install flash-attn --no-build-isolation
```

## Training

The included training scripts can be used as:

```
# Mistral-Instruct
accelerate launch --config_file=configs/accelerate.yaml train.py --config=configs/mistral_instruct.yaml

# Llama-3-Instruct
accelerate launch --config_file=configs/accelerate.yaml train.py --config=configs/llama3_instruct.yaml
```

If you want to enable FlashAttention 2, please uncomment the `attn_implementation: "flash_attention_2"` in `configs/mistral_instruct.yaml` and `configs/mistral_instruct.yaml`.

## Evaluation

We evaluate the models obtained through the provided training script using AlpacaEval 2, Arena-Hard, and MT-Bench. AlpacaEval 2 and Arena-Hard allow us to specify sampling configurations for the evaluation, and we strictly follow the sampling configuration used by SimPO. You can find each of these sampling configurations at `evals/alapcaeval2` and `evals/arenahard`. The expected results of each benchmark should match the results of Table 1 and Table 2, as we respecify below.

|Model|AlpacaEval 2 (LC / WR)|Arena-Hard (WR)|MT-Bench (Score)|
|:---|:---:|:---:|:---:|
|Mistral-Instruct|35.6 / 29.6|17.2|7.8|
|Llama-3-Instruct|46.4 / 44.9|36.7|8.0|