# Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

## 🔧 Quick Start

We implement our algorithm on two frameworks, OpenRLHF and verl, in 2 different branches respectively. If you are new to our project, we recommend using verl version.

### Installation

#### 1. OpenRLHF version

Please follow [OpenRLHF's guidance](https://github.com/OpenRLHF/OpenRLHF/tree/main?tab=readme-ov-file#installation) to configure required environments. Then run `pip install -r requirements.txt`.

#### 2. verl version

Please refer to the [official installation guidance](https://verl.readthedocs.io/en/latest/start/install.html#install-from-custom-environment) of verl.

### Training of PRM

We train the PRM in 2 stages using [TRL](https://github.com/huggingface/trl) and a [preprocessed PRM800K dataset](https://huggingface.co/datasets/HuggingFaceH4/prm800k-trl-dedup). In the first stage, we freeze the LLM and only train the last score layer (MLP) with 1e-4 learning rate rate for 3 epochs. In the second stage, we unfreeze the LLM and fine-tune all parameters with 1e-6 learning rate for 1 epoch.

```bash
cd PRM
# stage 1
bash train_stage_1.sh
# stage 2
bash train_stage_2.sh
```

### Training of LLM

#### 1. OpenRLHF version

Switch to the [openrlhf branch](https://github.com/CJReinforce/PURE/tree/openrlhf). Run the following command. The parameter `reward_mode` in the [script](https://github.com/CJReinforce/PURE/blob/openrlhf/examples/scripts/train_pure.sh) controls the reward type and can be set to `PRM`, `VR`, and `PRMVR`.

```bash
bash examples/scripts/train_pure.sh
```

It uses Ray+vLLM for rollout acceleration, with the first 4 GPUs allocated for the actor, initial actor (reference model), and PRM. The remaining GPUs are used for the vLLM engines. This setup works with 5 to 8 GPUs—just adjust the number of vLLM engines in the script accordingly.

#### 2. verl version

Switch to the [verl branch](https://github.com/CJReinforce/PURE/tree/verl). Set the reward type in the [config file](verl/trainer/config/ppo_trainer.yaml):

1. `PURE-VR` uses `reward_model.enable=False reward_model.reward_manager=prime`
2. `PURE-PRM` uses `reward_model.enable=True reward_model.reward_manager=blank`
3. `PURE-PRM+VR` uses `reward_model.enable=True reward_model.reward_manager=prime`.

Then start training:

```bash
python -m verl.trainer.main_ppo
```

The hybrid engine of verl allows for higher gpu utilization compared to the openrlhf version.

### Evaluation

We use [Qwen Math's codebase](https://github.com/QwenLM/Qwen2.5-Math/tree/main/evaluation) for evaluation (i.e., pass@1 accuracy). For fairness considerations, we completely prohibited solving problems by calling code, following SimpleRL. Please follow the `/eval` instructions for evaluation.

## 🌻 Acknowledgement

We implement our RL algorithm based on [OpenRLHF](https://github.com/OpenRLHF/OpenRLHF) and [verl](https://github.com/volcengine/verl). We thank the developers of OpenRLHF and the author of SimpleRL for discussion! In addition, we also refer to [TRL](https://github.com/huggingface/trl), [PRIME](https://github.com/PRIME-RL/PRIME)'s code and hyperparameter values to varying degrees. Thank them for their wonderful work!
