# RLHF-LLM

This is the repository for RLHF-LLM. Our current codebase has implemented BasicTrainer for DPO training, RewardTrainer for reward model training, and PPOTrainer for PPO training. 

### Update 11-30-2024:
- We upgraded PPO Trainer to v2. To use it right now, you no longer need to set `use_policy_with_head=true`.
- Also, since PPO training requires large GPU memory, you probably need to use LORA model by setting `use_peft=true` in the command.
- If you are working with OpenAI Summarization dataset, in addition to our checkpoints, you can directly use models from Hugging Face. For SFT model, set `model.name_or_path=cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr` and `model.tokenizer_name_or_path=EleutherAI/pythia-1b-deduped`. For value model and reward model, set `reward_model=pythia1 reward_model.name_or_path=cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr`. (Only setting reward model suffices, since the value model will load the same checkpoint as what the reward model does.)

### Update 11-19-2024:
- We now supported OpenAI summarization dataset. To use it, please set `dataset=[oai_summarization_sft]` if doing SFT training or `dataset=[oai_summarization]` for the preference dataset.
- You no longer need to set `use_policy_with_head=true` when doing PPO training (for both SFT policy and PPO).

### Basic Trainer
To use Basic Trainer, the basic command is:
```
CUDA_VISIBLE_DEVICES=0 python -u code/main.py model=gpt2-large datasets=[shp] \
    loss=sft gradient_accumulation_steps=8 batch_size=64 \
    eval_batch_size=8 trainer=BasicTrainer sample_during_eval=true \
```
If use multiple devices, set `trainer=FSDP.BasicTrainer`:
```
CUDA_VISIBLE_DEVICES=0,1 python -u code/main.py model=gpt2-large datasets=[shp] \
    loss=sft gradient_accumulation_steps=8 batch_size=64 \
    eval_batch_size=8 trainer=FSDP.BasicTrainer sample_during_eval=true \
```
Note: if you are going to fine-tune the policy model and not using DPO, please remember to set `loss=sft`. If using DPO training, set `loss=dpo`.

### Reward Trainer
To use Reward Trainer, the basic command is:
```
CUDA_VISIBLE_DEVICES=0 python -u code/main.py reward_model=gpt2-large datasets=[shp] \
    loss=reward_loss exp_name=shp_sft_gpt2 gradient_accumulation_steps=8 batch_size=32 \
    eval_batch_size=4 trainer=FSDP.RewardTrainer sample_during_eval=false \
    model.fsdp_policy_mp=bfloat16  \
```
If use multiple devices, set `trainer=FSDP.RewardTrainer`:
```
CUDA_VISIBLE_DEVICES=0,1 python -u code/main.py reward_model=gpt2-large datasets=[shp] \
    loss=reward_loss exp_name=shp_sft_gpt2 gradient_accumulation_steps=8 batch_size=32 \
    eval_batch_size=4 trainer=FSDP.RewardTrainer sample_during_eval=false \
    model.fsdp_policy_mp=bfloat16 \
```
Note: please make sure you modify `reward_model` argument if you want to use a different reward model instead of modifying `model`. And, please make sure to set `loss=reward_loss`.

### PPO Trainer
To use the PPO Trainer, the basic command is:

```
CUDA_VISIBLE_DEVICES=0 python -u code/main.py 
    model=pythia1 \
    datasets=[oai_summary_sft] \
    loss=ppo \
    gradient_accumulation_steps=16 \
    batch_size=64 \
    eval_batch_size=1 \
    trainer=PPOTrainer \
    model.fsdp_policy_mp=bfloat16 \
    debug=false \
    loss.local_rollout_forward_batch_size=16 \
    reward_model=pythia1 \
    reward_model.name_or_path=cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
    model.name_or_path=cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
    model.tokenizer_name_or_path=EleutherAI/pythia-1b-deduped \
    use_peft=true \
```

If use mutliple devices, set `trainer=FSDP.PPOTrainer`:
```
policy_path=$change this$
reward_model_path=$change this$

CUDA_VISIBLE_DEVICES=0,1 python -u code/main.py 
    model=pythia1 \
    datasets=[oai_summary_sft] \
    loss=ppo \
    gradient_accumulation_steps=16 \
    batch_size=64 \
    eval_batch_size=1 \
    trainer=FSDP.PPOTrainer \
    model.fsdp_policy_mp=bfloat16 \
    debug=false \
    loss.local_rollout_forward_batch_size=16 \
    reward_model=pythia1 \
    reward_model.name_or_path=cleanrl/EleutherAI_pythia-1b-deduped__reward__tldr \
    model.name_or_path=cleanrl/EleutherAI_pythia-1b-deduped__sft__tldr \
    model.tokenizer_name_or_path=EleutherAI/pythia-1b-deduped \
    use_peft=true \
```

### Saved Models and Metrics

By default, all checkpoints and models will be saved to `$local_dirs/$user_name/$exp_name/`, you can modify `$local_dirs` in `config.yaml` file.


### Add Your Models / Reward models / Losses
For now, the codebase supports DPO, SFT, and PPO loss. If you need something else, please add yaml files under the directory `config/loss/your_loss.yaml` accordingly.

And, the codebase supports BERT, GPT2-Large, GPT2-XL, GPT2, GPTJ, llama7b, pythia28, and pythia69 models. If you need to train other models, please add yaml files under the directory `config/model/your_model.yaml` accordingly.

## Running Evaluation

In short, bash scripts to run evaluation can be found at `eval_scripts/`. 


### Generating Training Datasets

Generate the sft and rm datasets using 
```
bash eval_scripts/sft_rm_ds.sh
```

Refer to `eval/process_sft.py` for the full list of arguments. 


###  Generating Responses From Fine-Tuned Model  

`eval/generate_response.py` is the main script responsible for generating responses for evaluation. The dataset with responses is stored as a HuggingFace dataset, wherein a new column 'model_response' is added to the original dataset. Refer to the following table of arguments:

| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `model_path` | `str` | `""` | Specify model HuggingFace path to generate responses for evaluation |
| `model_name` | `str` | `""` | Specify model name to generate responses for evaluation |
| `device` | `str` | `'cuda'` | Model device |
| `use_auto` | `bool` | `True` | Allocate model parameters across available devices (GPU, CPU) |
| `ds_type` | `str` | `"summary"` | Dataset type for evaluation (summarization, SHP, or other) |
| `split` | `str` | `'train'` | Evaluation dataset split |
| `num_examples` | `int` | `-1` | Number of examples for evaluation |
| `ds_seed` | `int` | `42` | Dataset seed after shuffling for reproducibility |
| `hf_org` | `str` | `""` | HuggingFace organization to store dataset with saved responses |
| `cache_dir` | `str` | `""` | Cache directory for HuggingFace dataset and models |
| `prompt_template` | `str` | `'summary_0_shot'` | Prompt template to be used for prompting |
| `use_cache` | `bool` | `True` | Use cache for generation |
| `do_sample` | `bool` | `False` | Whether or not to use greedy decoding |
| `temperature` | `float` | `0.2` | Temperature during generation |
| `max_new_tokens` | `int` | `400` | Maximum number of new tokens to generate | 

An example script can be found at  `eval_scripts/generate_response.sh`. 

### Benchmarking Model with Generated Responses 

`eval/benchmark_models.py` is the main script responsible for benchmarking a model's responses against the reference response from the dataset or 2 models' responses against each other and computing win rate. Refer to the following table of arguments:  

| Attribute | Type | Default Value | Description |
| -------- | ---- | -------------- | ----------- |
| model_path | str | Qwen/Qwen2.5-72B-Instruct | label model path for evaluation |
| model_name | str | qwen2.5-72b | label model name for evaluation |
| response1_path | str | {dataset_path} | huggingface hub path storing location of response 1 |
| response2_path | str | {dataset_path}  | huggingface hub path storing location of response 2 |
| hf_org | str |  | hf org to store dataset with saved responses |
| cache_dir | str |  | cache hf dataset and models |
| ds_type | str | summary | summarization, SHP, or other dataset type |
| split | str | train | dataset split for evaluation |
| num_examples | int | -1 | num examples for evaluation |
| seed | int | 42 | dataset seed after shuffling for reproducibility |
| prompt_template | str | detailed_1_shot_preamble | prompt template to be used for prompting label model |
| zero_id | str | 1 | label for response 1 |
| one_id | str | 2 | label for response 2 |
| reprompt | bool | False | have label LLM generate full response and then compute forward pass to extract label |
| device | str | cuda | label model device arguments |
| use_auto | bool | True | allocate model parameters across available devices (GPU, CPU) |
| use_cache | bool | True | temperature sampling |
| do_sample | bool | False | whether or not to use greedy decoding |
| temperature | float | 0.2 | temperature during generation |
| max_new_tokens | int | 400 | max new tokens |

An example script for computing win-rate of 1 model against reference response can be found at `eval_scripts/benchmark_model.sh`. An example script for comparing 2 different models can be found at `eval_scripts/benchmark_2models.sh`. The results will be stored in `eval_results/` as a JSON in the following format: 

```
{'model1_win_rate': #num of model 1 responses  preferred by label LLM over those from model 2 / total number of examples , 
'model2_win_rate': #num of model 2 responses  preferred by label LLM over those from model 1 / total number of examples , 
'resp1_pref': # indices of examples where model 1 response is preferred,
'resp2_pref':  # indices of examples where model 2 response is preferred, 
'resp1_path': # identifier for model 1. hf dataset path containing responses generated by model 1,
'resp2_path': # identifier for model 2. hf dataset path containing responses generated by model 2}
```






