# TBPO

Code for paper: **TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching**

## Training

### Environment setup

```bash
conda env create -f env/train_env.yml
conda activate TokenBPO-train
```

### Configuration

- `config/config.yaml`: main config file
- `config/model/*.yaml`: model config files
- `config/loss/*.yaml`: loss config files

Adapt these configs to your needs.

### Run

Example (Q version):

```bash
bash script/train/Q_version/llama_general.sh
```

### Merge model

Edit `merge.py` and run:

```bash
python merge.py
```

## Evaluation

### Environment setup

```bash
conda env create -f env/eval_env.yml
conda activate TokenBPO-eval
```

### Run

1. Change the model name and settings in `script/eval/runall.sh`
2. Run:

```bash
bash script/eval/runall.sh
```

## Other evaluations

### Download data

```bash
hf download tonyshelby/processed_data --repo-type dataset --local-dir ./processed_data
```

### Win rate (AlpacaEval)

```bash
conda create -n alpaca-eval python=3.11.11
conda activate alpaca-eval
pip install 'alpaca-eval[all]'
```

Configure the LLM judge/model/prompts, then run:

```bash
bash winrate_eval/eval.sh
```

### Diversity

```bash
conda create -n diversity-metrics python=3.11.11
conda activate diversity-metrics
pip install vllm transformers sacrebleu tqdm
```

Then run generation and compute metrics as described in `diversity_metrics/notes.md`.

### MT-bench

```bash
conda create -n mtbench python=3.11.11
conda activate mtbench
cd mtbench/FastChat
pip install -e ".[model_worker,llm_judge]"
pip install vllm
```

Then run the model server and generate answers as described in `mtbench/notes.md`.
