# Codes for Provably Efficient and Practical Self-play Style RLHF for Better LLM Alignment


## Contents

We include the codes for paper Provably Efficient and Practical Self-play Style RLHF for Better LLM Alignment. This codebase contains codes for TANPO, SADPO and extended experiment for TADPO. These codes are based on Alignment Handbook (https://github.com/huggingface/alignment-handbook.git).

## Installation instructions

To run the code in this project, first, create a Python virtual environment using e.g. Conda:

```shell
conda create -n handbook python=3.10 && conda activate handbook
```

Next, install PyTorch `v2.1.2` - the precise version is important for reproducibility! Since this is hardware-dependent, we
direct you to the [PyTorch Installation Page](https://pytorch.org/get-started/locally/).

You can then install the remaining package dependencies as follows:

```shell
git clone https://github.com/huggingface/alignment-handbook.git
cd ./alignment-handbook/
python -m pip install .
```

You will also need Flash Attention 2 installed, which can be done by running:

```shell
python -m pip install flash-attn==2.3.6 --no-build-isolation
```

> **Note**
> If your machine has less than 96GB of RAM and many CPU cores, reduce the `MAX_JOBS` arguments, e.g. `MAX_JOBS=4 pip install flash-attn==2.3.6 --no-build-isolation`

Next, log into your Hugging Face account as follows:

```shell
huggingface-cli login
```

Finally, install Git LFS so that you can push models to the Hugging Face Hub:

```shell
sudo apt-get install git-lfs
```

## How to Run the Codes

To run TANPO experiment, run
```shell
./run_tanpo.sh
```

To run SADPO experiment, run
```shell
./run_sadpo.sh
```

To run TANPO on a second epoch, run
```shell
./run_tanpo_epoch_2.sh
```