# Distributionally Robust RLHF

## Reward Models

To train the reward models see `rew_uf_g-2b.sh` with a chosen `EPS` set via an environment variable.
Optionally set `DIST_FN=chi2o` to use chi-squared distance function.

For example, to run DR RM $\varepsilon = 0.2$ training with Slurm:
```bash
sbatch --export EPS=0.2 --partition=your-partition --gres gpu:8 rew_uf_g-2b.sh
```
Trained model should be saved in `models/reward_uf_lr1e-05_dr_eps0.2_google_gemma-2b-it`.

Evaluation on RewardBench:
```bash
python eval_reward_bench.py --torch_dtype bfloat16 --attn_implementation flash_attention_2 --batch_size 16 --not_quantized --model google/gemma-2b-it --peft_name models/reward_uf_lr1e-05_dr_eps0.2_google_gemma-2b-it
```

## PPO

See `ppo_uf_g-2b.sh`.
For example, to run DR PPO $\varepsilon = 0.2$ with a DR reward model also trained with $\varepsilon = 0.2$ with Slurm:
```bash
sbatch --export EPS=0.2,REW_MODEL=models/reward_uf_400k_lr1e-05_google_gemma-2b-it --partition=your-partition --gres gpu:8 ppo_uf_g-2b.sh
```
Trained model should be saved in `models/ppo_reward_uf_400k_lr1e-05_dr_eps0.2_google_gemma-2b-it_5k_lr1e-05_kl0.02_dr_eps0.2_solvereward_google_gemma-2b-it`.
Optionally set `LOSS_TYPE=pilossgrad_pi` to run the *scaled loss* version or `DIST_FN=chi2o` to use chi-squared distance function.

Evaluation on RewardBench:
```bash
python eval_rlhf_reward_bench.py --torch_dtype bfloat16 --attn_implementation flash_attention_2 --batch_size 16 --not_quantized --model models/ppo_reward_uf_400k_lr1e-05_dr_eps0.2_google_gemma-2b-it_5k_lr1e-05_kl0.02_dr_eps0.2_solvereward_google_gemma-2b-it
```

Evaluation on Unified-Feedback, HHH Alignment, and MT-Bench:
```bash
python eval_rlhf_uf_hhh_mtbench.py --model models/ppo_reward_uf_400k_lr1e-05_dr_eps0.2_google_gemma-2b-it_5k_lr1e-05_kl0.02_dr_eps0.2_solvereward_google_gemma-2b-it
```

## DPO

See `dpo_uf_g-2b.sh`.
For example, to run DR DPO $\varepsilon = 0.2$ with Slurm:
```bash
sbatch --export EPS=0.2 --partition=your-partition --gres gpu:8 dpo_uf_g-2b.sh
```
Trained model should be saved in `dpo_400k_lr1e-05_dr_eps0.2_google_gemma-2b-it`.
Optionally set `DIST_FN=chi2o` to use chi-squared distance function.

Evaluation on RewardBench:
```bash
python eval_rlhf_reward_bench.py --torch_dtype bfloat16 --attn_implementation flash_attention_2 --batch_size 16 --not_quantized --model models/dpo_400k_lr1e-05_dr_eps0.2_google_gemma-2b-it
```

Evaluation on Unified-Feedback, HHH Alignment, and MT-Bench:
```bash
python eval_rlhf_uf_hhh_mtbench.py --model models/dpo_400k_lr1e-05_dr_eps0.2_google_gemma-2b-it
```
