In this work, we introduce **Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance**, a novel framework to mitigate format biases (e.g., length, lists, bolding) in reward models (RMs) for large language models.

Our approach minimizes the mutual information between the *difference* in response representations and the *relative* bias attributes. This is achieved by training a variational network adversarially against the RM's encoder, encouraging it to learn representations that are invariant to spurious format correlations while retaining true preference signals.

## ⚙️ 1. Setup and Installation

First, we recommend creating a Conda virtual environment and installing the required dependencies.

```bash
# Create and activate the conda environment
conda create -n dir python=3.9
conda activate dir

# Install all other dependencies
pip install -r requirements.txt
```

## 📥 2. Data and Model Preparation

We provide convenient scripts to download all the necessary datasets (e.g., UltraFeedback, bias evaluation sets) and the base models (e.g., Llama-3-8B-Instruct) used in our experiments.

Run the following commands from the project's root directory:

```bash
# Navigate to the scripts directory
cd scripts

# Download all required datasets
bash auto_download_data.sh

# Download the base language models
bash auto_download_model.sh
```

After the scripts complete, your data and models will be organized in the designated directories.

## 🚀 3. Training and Evaluation Pipeline

The full experimental pipeline consists of three main stages: training the debiased reward model, aligning a policy model using PPO, and evaluating the final policy.

### Step 3.1: Train the Debiased Reward Model (DIR)

To train our debiased reward model using the DIR framework, run the `train_debias_rm.sh` script. This script orchestrates the training process defined in `reward_models/run_debias_reward_models_train.py`.

```bash
# Make sure you are in the scripts/ directory
bash train_debias_rm.sh
```

The training logs and final RM checkpoints will be saved to the output directory specified within the script (e.g., `../exp/debiased_rm`).

### Step 3.2: Align a Policy with PPO

Once the debiased RM is trained, we use it to provide rewards for aligning a policy model with Proximal Policy Optimization (PPO).

**Important:** Before running, you must edit `ms_ppo_script.sh` and update the `REWARD_MODEL_PATH` variable to point to the checkpoint of the debiased RM you trained in the previous step. Plase make sure you have cloned the MS-Swift successfully.

```bash
# Example modification inside ms_ppo_script.sh:
# REWARD_MODEL_PATH="../exp/debiased_rm/checkpoint-final"

# Run the PPO training script
bash ms_ppo_script.sh
```

This will train a policy model and save the checkpoints to the specified output directory.

### Step 3.3: Evaluate the Final Aligned Policy

Finally, we evaluate the performance of the PPO-aligned policy model on various benchmarks using the `evalscope` evaluation tool.

**Important:** Before running, you must edit `rm_eval/evalscope_evaluation_script.sh` and update the `MODEL_PATH` variable to point to the PPO-aligned model checkpoint from Step 3.2. Plase make sure you have cloned the EvalScope successfully.

```bash
# From the root directory, run the evaluation script
bash rm_eval/evalscope_evaluation_script.sh
```

The script will generate responses for the benchmark prompts and compute the final evaluation scores, saving the results to the specified output directory.

## 📁 Repository Structure

```
.
├── deepspeed_configs/     # DeepSpeed configuration files
├── reward_models/         # Core logic for training all reward models
│   ├── debias_trainer.py  # Trainer implementing the DIR framework
│   └── run_debias_reward_models_train.py # Main script to launch DIR training
├── rm_eval/               # Scripts for evaluating reward and policy models
│   ├── eval_biasbench.py  # Evaluate format bias
│   └── evalscope_evaluation_script.sh # Evaluate policy performance
├── scripts/               # Main workflow orchestration scripts
│   ├── auto_download_data.sh
│   ├── auto_download_model.sh
│   ├── train_debias_rm.sh # Use this to train our model
│   └── ms_ppo_script.sh   # Use this for PPO alignment
├── requirements.txt       # Project dependencies
└── README.md              # This file
```

