# On-Policy Preference Data Generation

We provide the code to generate on-policy preference data and on-policy iterative preference data used in our experiments.

## Requirements

You will need to install the [`vllm`](https://github.com/vllm-project/vllm) package for decoding.

## On-Policy Preference Data Generation Process

1. Generate multiple responses using the language model:

```
python decode.py --model $Model_Name --data_dir $DATASET_DIR --seed $SEED
```
This will generate one response per prompt under the specified seed. You need to provide a dataset containing prompts (by default, we use `HuggingFaceH4/ultrafeedback_binarized`). You can also set decoding hyperparameters by passing in corresponding arguments (by default, we use a temperature of `0.8` for sampling).

Note that you will need to run the above command under **multiple different** seeds (by default, we use `13, 21, 42, 79, 100`) to obtain different responses for each prompt.


2. Annotate the preference labels with a reward model

```
python generate_reward.py --reward_model $MODEL
```

This will score the generations using a reward model (by default, we use `Skywork/Skywork-Reward-Llama-3.1-8B-v0.2`) and create the dataset by taking the top2 highest-scoring responses as the chosen response set and the bottom2 scoring responses as the rejection response set.

