# Regressing the Relative Future: Efficient Policy Optimization for Multi-turn RLHF

## Environment

```
torch>=2.1.0
transformers>=4.34
accelerate>=0.23
peft==0.6.2
bitsandbytes>=0.41.1
deepspeed>=0.10.3
tyro
scipy
rouge
shortuuid
jsonlines
rich
wandb
tensorboard
pandas
evaluate
```

## Setting One

In this setting, at each iteration, we first generate the dialogues for the entire dataset (Huggingface Dataset Card: openbmb/UltraInteract_pair) using our policy as the assistant and Llama-3.1-70B-it as the user.
```
python generate.py
```

Then, we generate the rewards for all the dialogues using the ArmoRM (Huggingface Model Card: RLHFlow/ArmoRM-Llama3-8B-v0.1) as the reward model.
```
python rank.py
```

After that, the dataset go through a rigorous filtering process. We filter out the dialogues in the dataset that are longer than 2048 tokens, have the same set of responses, and do not produce a valid reward score. We tokenize the dialogue and generate a mask for each dialogue.
```
python tokenize_masks.py
```

Finally, we train the Llama-3-8B-it by running:

```
accelerate launch \
    --config_file accelerate_cfgs/ds_config2.yaml \
    --num_processes 8 \
    ./refuel.py \
        --task.total_length 2048 \
        --task.temperature 0.8 \
        --lr 3e-7 \
        --rebel.eta 1e3 \
        --warmup_ratio 0.1 \
        --total_episodes 64000 \
        --per_device_train_batch_size 1 \
        --gradient_accumulation_steps 16 \
        --per_device_eval_batch_size 1 \
        --print_sample_output_freq 100
```

## Setting Two

#### Anthropic HH

First, we process the dataset (Huggingface Dataset Card: trl-internal-testing/hh-rlhf-trl-style) by filtering out dialogues with more than 5 turns, prompts more than 128 tokens, responses with more than 512 tokens.
```
python preprocess_hh.py
```

Then, we train the Llama-3-8B-it with reward model FsfairX (Huggingface Model Card: sfairXC/FsfairX-LLaMA3-RM-v0.1) by running:
```
accelerate launch --config_file accelerate_cfgs/deepspeed_config.yaml --main_process_port 29073 --num_processes 8 ./refuel.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \
	--per_device_train_batch_size 1 \
	--gradient_accumulation_steps 4 \
	--per_device_eval_batch_size 1 \
	--lr 3e-7 \
	--eps 1e-8 \
	--weight_decay 1e-6 \
	--reward.kl_coef 0.05 \
	--rebel.eta 1.0 \
	--task.penalty_reward_value -10 \
	--print_sample_output_freq 200 \
	--task.response_length 512 \
	--offload
```

#### Ultrainteract

First, we process the dataset (Huggingface Dataset Card: openbmb/UltraInteract_pair) by filtering out dialogues with more than 5 turns, and prompts and responses that exceed the length in Table 5 of the paper.
```
python preprocess_ultrainteract_diff_len.py
```

Then, we train the Llama-3-8B-it with reward model FsfairX (Huggingface Model Card: sfairXC/FsfairX-LLaMA3-RM-v0.1) by running:

```
accelerate launch --config_file accelerate_cfgs/deepspeed_config.yaml --main_process_port 29073 --num_processes 8 ./refuel.py \
	--base_model meta-llama/Meta-Llama-3-8B-Instruct \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 4 \
    --per_device_eval_batch_size 1 \
    --wandb_project_name multiturn \
    --lr 3e-7 \
    --eps 1e-8 \
    --weight_decay 1e-6 \
    --reward.kl_coef 0 \
    --rebel.eta 1.0 \
    --task.penalty_reward_value -4 \
    --print_sample_output_freq 200 \
    --offload
```