# Sample Efficient Preference Alignment in LLMs via Active Exploration

## What is this repo?

This repo includes an implementation our methods built on the code provided by the [DPO paper](https://github.com/eric-mitchell/direct-preference-optimization).

The DPO pipeline has two stages:

1. Run supervised fine-tuning (SFT) on the dataset(s) of interest.
2. Run active learning for preference data.

Below is an example on how to use the code to run SFT, AE-Borda-DPO and AE-DPO.

## Running SFT

```bash
python -u train.py model=gpt2-large datasets=[hh] loss=sft exp_name=hh_sft_gpt2-large gradient_accumulation_steps=2 batch_size=32 eval_batch_size=16 trainer=FSDPTrainer sample_during_eval=false
```

## Running Active learning with AE-Borda-DPO

```bash
python -u train.py model=gpt2-large datasets=[hh] loss=dpo loss.beta=0.1 model.archive=sft_policy.pt exp_name=hh_borda_gpt2-large gradient_accumulation_steps=2 batch_size=32 eval_batch_size=16 trainer=BasicTrainer sample_during_eval=true pretrain=false online=true max_train_examples=30000 have_llm_dropout=true selection_strategy=borda
```

## Running Active learning with AE-DPO

```bash
python -u train.py model=gpt2-large datasets=[hh] loss=dpo loss.beta=0.1 model.archive=sft_policy exp_name=hh_ae_gpt2-large gradient_accumulation_steps=2 batch_size=32 eval_batch_size=16 trainer=BasicTrainer sample_during_eval=true pretrain=false active=true  have_llm_dropout=true
```

Note: The contributed datasets Jeopardy! and Haikus were not included in the supplementary materials due to size limit enforced by the conference. 