# Adaptive Uncertainty-Based Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) is a popular technique to align large language models (LLMs) to human preferences.
It requires learning a reward model that predicts scalar values given a generated text sequence, acting as a proxy for human preference scores.
A central problem of RLHF is *reward hacking*, i.e., overoptimization. LLMs can easily exploit the reward model by generating text that can receive high scores but no longer align with human preferences.
We address this problem by proposing a novel objective which adapts the tradeoff between reward model score and regularisation based on reward uncertainty.
We hypothesize that when the reward model uncertainty is low, RLHF should make a larger step size by lowering the regularization coefficient.
On the other hand, when the uncertainty is high, optimization should slow down by staying closer to the original model.
We present a novel re-formulation of the RLHF objective and derive our approach from its generalization to account for reward model variance.
We demonstrate that our uncertainty-aware RLHF objective mitigates overoptimization and outperforms vanilla RLHF by 50% on a standard summarization task.

## Quick Setup
You might need an AWS account with SageMaker and Bedrock properly enabled.
To setup the local python environment, run the following command.
```
pip install -r requirements.txt
```
You do not need this if you do not intend to run the scripts locally, but be sure to have AWS CLI with its python library properly configured.

As a last step, create the `.env` file in the root project directory and copy the content from `.env.template`. Fill in your own WandB secret to see the logs from the training runs.

## Pipeline Overview

The Figure below describes the order of the executed scripts that comprise of the main pipeline, and what components are required at each point. Note that in order to run these, we need to have trained Gold Reward Model in advance (see the Gold Reward Model Pipeline).

![RLHF Pipeline](docs/RLHF%20Pipeline.png)

To run the above-mentioned pipeline from a single script, first install [yq](https://github.com/mikefarah/yq#install) (e.g., `brew install yq`).

Also, set up an AWS account to execute SageMaker training jobs (not necessary with a powerful GPU, all commands can be adjusted to be run locally).

Then, to run the pipeline with the GPT-2 config and TL;DR dataset, execute the following script:
```
./run_all.sh -c gpt2 -d tldr
```

The `-c` flag specifies which configs to use ([from the configs folder](/configs)) and the `-d` flag specifies what dataset to use. Currently supported datasets are `tldr` and `alpaca`, but the codebase is easily extendable for other datasets (see the implemented datasets format in [this file](src/utils/dataset.py)).

## Gold Reward Model Pipeline
The Figure below describes the process to obtain the gold reward model. First, we need to have trained the SFT models of all sizes. Then, we generate the answers from all SFT models for the specified dataset and create the pairs from them. After that, we get the preferences by annotating the data using Claude by calling Amazon Bedrock. Finally, we train the Gold RM on the annotated preference dataset.

![Gold Reward Model Pipeline](docs/Gold%20Reward%20Model.png)

## Experiment with the Pipeline
To test other hyperparameters, simply edit the configuration files under the [configs/<model name> directory](/configs/), and run the desired script. If you wish to run multiple PPO runs at once using SageMaker, there is a handy script that executes multiple configurations at once: Uncertainty-Adaptive RLHF, Standard RLHF with ensembles, Standard RLHF with a single RM, and Pessimistic ensemble RLHF.
```
./run_ppos.sh -c gpt2 -d tldr
```
