# Pareto-Optimal Learning from Preferences

This repository contains code for the paper Pareto-Optimal Learning from Preferences with Hidden Context. It includes the implementation of Pareto-Optimal Learning from Preferences (POPL) and Bayesian Reward Extrapolation (B-REx) for the LLM experiments in the paper.

## Installation

1.  Install Python 3.8, 3.9, 3.10, or 3.11.
2.  Install pip requirements:

        pip install -r requirements.txt

## Data and pre-trained models

We use the data and pre-trained models from the following GitHub repository (under the `data` directory): https://github.com/cassidylaidlaw/hidden-context/tree/main/data.

## Running experiments

### Extracting Training set features

To extract last-layer features from the training set, run

    python -m popl.rex_feature_extract --model_name=meta-llama/Llama-2-7b-hf --reward_model_checkpoint=data/reward_models/relabeled_hh_rlhf/both/base_Llama-2-7b-hf__0_3e-06_cosine_1_peft_last_checkpoint --data_path=data/relabeled_hh_rlhf

### Running LLM reward inference with POPL or B-REx

To run POPL on the extracted features, run

    python -m popl.rex_main --features_dir data/reward_models/relabeled_hh_rlhf/both/base_Llama-2-7b-hf__0_3e-06_cosine_1_peft_last_checkpoint/rex_features --method lexicase --mcmc_step_size 0.05

- To run B-REx, replace `lexicase` with `brex`.

### Evaluating LLM reward models on HH-RLHF

To evaluate the LLM reward models generated by reward inference, run

    python -u popl.evaluate_hh --model_name=meta-llama/Llama-2-7b-hf --num_outputs=1 --reward_model_checkpoint=data/reward_models/relabeled_hh_rlhf/both/base_Llama-2-7b-hf__0_3e-06_cosine_1_peft_last_checkpoint --rex_mcmc_file=data/reward_models/relabeled_hh_rlhf/both/base_Llama-2-7b-hf__0_3e-06_cosine_1_peft_last_checkpoint/lexicase/mcmc_chain.txt --rex_normalize median

- Replace `lexicase` with `brex` if you are evaluating B-REx reward models.

- This script will produce `eval_results_hh.jsonl` in the `mcmc_chain` file folder with the raw outputs of the reward model for each of the response pairs in the HH-RLHF test set.

### Evaluating LLM reward models on jailbreaks

To evaluate the LLM reward models on responses to the Jailbroken prompts, run

    python -u popl.evaluate_jailbreak --model_name=meta-llama/Llama-2-7b-hf --num_outputs=1 --reward_model_checkpoint=data/reward_models/relabeled_hh_rlhf/both/base_Llama-2-7b-hf__0_3e-06_cosine_1_peft_last_checkpoint --rex_mcmc_file=data/reward_models/relabeled_hh_rlhf/both/base_Llama-2-7b-hf__0_3e-06_cosine_1_peft_last_checkpoint/lexicase/mcmc_chain.txt --input=data/jailbroken_responses.jsonl --rex_normalize median

- Replace `lexicase` with `brex` if you are evaluating B-REx reward models.

- This script will produce `eval_results_jail.jsonl` in the `mcmc_chain` file folder with the raw outputs of the reward model for each of the jailbreak response pairs.

### Analyzing evaluations

To obtain the results in the paper on POPL/BREx with LLM reward models, run

    python -m popl.summarize_results

This will load the data from experiments (as output from the evaluation scripts above) and summarize it into the results.
