Our implementation builds on `OpenRLHF`, as cited in the paper. We primarily modify the following components:

- `/openrlhf/cli/train_ppo_ray.py`
- `/openrlhf/trainer/ppo_utils/experience_maker.py`
- `/openrlhf/utils/remote_rm_utils.py`

These changes support our new reward modeling approach, which incorporates reference trajectories (responses) into the PPO training process. Specifically:

- We add new training arguments in `/openrlhf/cli/train_ppo_ray.py`.
- We adapt the experience data processing logic in `/openrlhf/trainer/ppo_utils/experience_maker.py` and `/openrlhf/datasets/prompts_dataset.py` to handle the new reward model and training data format.
- We enable support for remote reward models served via `SGLang` in `/openrlhf/utils/remote_rm_utils.py`.

For quick usage, see the example script at `/cmds/ray_ppo_example.sh`, which demonstrates how to run PPO experiments using our modifications. Important arguments are explained within the script.

Before starting PPO training, it is recommended to serve the reward model via API using a framework such as `SGLang`. For example:  
`python3 -m sglang.launch_server --model-path Skywork/Skywork-Reward-Llama-3.1-8B-v0.2 --host 0.0.0.0 --is-embedding`  
The service's IP address should be provided as the `reward_remote_url` argument in the training script.

We also provide a sample dataset containing 100 training samples at `/sample_data/head_100_ppo.jsonl`. The path to the training dataset should be specified using the `prompt_data_path` argument in the script.
