# Reward over-optimization

## Installation

**Native Runner:** Setup a conda environment using `conda` / `mamba`:

```bash
conda env create --file conda-recipe.yaml  # or `mamba env create --file conda-recipe.yaml`
```

This will automatically setup all dependencies.

## Training

0. Follow the instructions in section [Installation](#installation) to setup the training environment properly.

   ```bash
   conda activate roo
   export WANDB_API_KEY="..."  # your W&B API key here
   ```

1. ScoreLM model

   > Note: You may need to train a gold reward model first, and then generate the re-annotate date to train the scorelm model.

   ```bash
   bash scripts/scorelm.sh \
       --model_name_or_path PKU-Alignment/alpaca-7b-reproduced \
       --output_dir output/scorelm
   ```

2. BSPO

   > Note: You may need to train a gold reward model first.

   ```bash
   bash scripts/ppo_supported_value.sh \
       --actor_model_name_or_path PKU-Alignment/alpaca-7b-reproduced \
       --reward_model_name_or_path output/scorelm \
       --gold_model_name_or_path <gold_model_name_or_path> \
       --output_dir output/bspo
   ```

3. Baselines (Option)

   > Note: You may need to train a gold reward model first.

   1. PPO

      ```bash
      bash scripts/ppo.sh \
          --actor_model_name_or_path PKU-Alignment/alpaca-7b-reproduced \
          --reward_model_name_or_path output/scorelm \
          --gold_model_name_or_path <gold_model_name_or_path> \
          --output_dir output/ppo
      ```

   2. KL penalty

      ```bash
      bash scripts/kl.sh \
          --actor_model_name_or_path PKU-Alignment/alpaca-7b-reproduced \
          --reward_model_name_or_path output/scorelm \
          --gold_model_name_or_path <gold_model_name_or_path> \
          --output_dir output/ppo-kl 
      ```

   3. CPPO

      ```bash
      bash scripts/ConstrainedPPO.sh \
          --actor_model_name_or_path PKU-Alignment/alpaca-7b-reproduced \
          --cost_model_name_or_path output/scorelm \
          --gold_model_name_or_path <gold_model_name_or_path> \
          --output_dir output/cppo
      ```

   4. ENS

      ```bash
      bash scripts/uwo.sh \
          --actor_model_name_or_path PKU-Alignment/alpaca-7b-reproduced \
          --gold_model_name_or_path <gold_model_name_or_path> \
          --output_dir output/uwo
      ```

      ```bash
      bash scripts/wco.sh \
          --actor_model_name_or_path PKU-Alignment/alpaca-7b-reproduced \
          --gold_model_name_or_path <gold_model_name_or_path> \
          --output_dir output/wco
      ```

      > Note: You may need to train four reward models or scorelm models first to run this script.
