# Enhancing Safety in Reinforcement Learning with Human Feedback via Rectified Policy Optimization

This codebase contains the implementation of RePO.

## File Description
Our codebase is built upon the framework available at https://github.com/PKU-Alignment/safe-rlhf, which has been instrumental to our work. We are sincerely grateful for this valuable resource.

Under the original framework, we mainly added the following files to implement RePO：
+ `./safe_rlhf/algorithms/repo/` : the implementation of RePO
+ `./scrip/repo.sh`: the training script for RePO
+ `./safe_rlhf/datasets/raw/safety_llama.py`: an additional prompt-only dataset used to evaluate the LMs, which refer to [1] 


## Installation

Setup a conda environment using
```shell
conda env create --file conda-recipe.yaml
```
This process is the same as https://github.com/PKU-Alignment/safe-rlhf, where further details can be found.

## Training 

Before training, you should check the path of models and datasets.

RePO algorithm can be run via the command:

```shell
bash ./scripts/repo.sh
```

[1] Bianchi, F., Suzgun, M., Attanasio, G., Röttger, P., Jurafsky, D., Hashimoto, T., and Zou, J. (2023).
Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions.
arXiv preprint arXiv:2309.07875.
