# Preference-Based Reward Repair (PBRR)

This repository contains the implementation of **Preference-Based Reward Repair (PBRR)**, a novel approach for repairing reward functions to mitigate reward hacking for reinforcement learning agents. This code accompanies our ICLR submission.

## Repository Structure

```
ICLR_PBRR/
├── learn_reward/              # Core learning algorithms
│   ├── pbrr.py               # Main PBRR implementation
│   ├── repair_proxy_with_rrm.py  # RRM baseline
│   ├── repair_proxy_with_pi_ref_constraint.py  # RRM + State-Constraint baseline
│   ├── learn_reward_ab_initio_methods.py  # Online-RLHF and Online-RLHF + State-Constraint baselines
│   └── train_policy.py       # Policy training
├── reward_modeling/           # Reward model components
│   ├── reward_wrapper_pbrr.py   # PBRR reward learning methods
│   ├── reward_wrapper_ensemble.py  # Default reward learning methods
│   └── replay_buffer.py      # Experience replay utilities
├── occupancy_measures/        # Code from Laidlaw et al., 2024
├── utils/                     # Utility functions
├── data/                      # Experimental data and checkpoints
│   └── base_policy_checkpoints/  # Reference policy checkpoints
└── {env_name}_run_scripts/    # Environment-specific run scripts
    └── ai_safety_run_scripts/ # AI Safety Gridworld scripts
    └── pandemic_run_scripts/ # Pandemic Mitigation scripts
    └── glucose_run_scripts/ # Glucose Monitoring scripts
    └── traffic_run_scripts/ # Traffic Control scripts

```

## Quick Start

### Prerequisites

- Python 3.8+
- PyTorch
- Install dependencies: `pip install -r requirements.txt`

### Running Experiments

Our main results can be reproduced using the provided run scripts. For each environment, navigate to the corresponding `{env_name}_run_scripts/` directory:

```bash
cd {env_name}_run_scripts/

# Run PBRR (our method)
./run_pbrr.sh

# Run baseline methods
./run_rrm.sh                    # Residual Reward Modeling (RRM) baseline
./run_online_rlhf.sh           # Online RLHF baseline

# Run constrained variants
./run_rrm_state_constraint.sh # Residual Reward Modeling (RRM) + State-Constraint baseline
./run_online_rlhf_state_constraint.sh # Online RLHF + State-Constraint baseline
```

### Configuration

Each run script contains environment-specific configurations including:
- Hyperparameters for learning algorithms
- Output directories for results and checkpoints

Note: the mean-return with respect to the ground-truth reward function is logged by each run-script under the policy ID 'current/mean_return'. 

Note: due to the reliance on learn_reward.unique_id_state to save/load reward functions while training, muliple reward learning scripts cannot be run in parallel. In practice, we duplicate the learn_reward folder and change the learn_reward.unique_id_state.state["unique_id"] value for each run. 