# Algorithm Baselines

Last updated: 06/18/2025.

## Math related datasets

### GSM8k

Assuming GSM8k/math dataset is preprocessed via:

```bash
python3 examples/data_preprocess/*.py
```

Refer to the table below to reproduce RL training from different pre-trained checkpoints. Below is the performance on the GSM8k dataset if not specified otherwise. More comprehensive benchmark results areavailable in the recipe folder.


| Hardware    | Model                            | Method            | Test score   | Details |
|-------------|----------------------------------|-------------------|--------------|---------|
| NVIDIA GPU  | google/gemma-2-2b-it             | hf checkpoint     | 23.9         | [Huggingface](XXXX) |
| NVIDIA GPU  | google/gemma-2-2b-it             | SFT               | 52.06        | [command and logs](XXXX) |
| NVIDIA GPU  | google/gemma-2-2b-it             | SFT + PPO         | 64.02        | [command and logs](XXXX), [wandb](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2.5-0.5B-Instruct       | hf checkpoint     | 36.4         | [Qwen blog](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2.5-0.5B-Instruct       | PPO               | 56.7         | [command and log](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2.5-0.5B-Instruct       | PRIME             | 58.7         | [script](XXXX), [wandb](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2.5-0.5B-Instruct       | GRPO-LoRA         | 54.3         | [command and logs](XXXX)|
| NVIDIA GPU  | Qwen/Qwen2.5-1.5B-Instruct       | GRPO-LoRA         | 77.9         | [command and logs](XXXX)|
| NVIDIA GPU  | Qwen/Qwen2.5-3B-Instruct         | GRPO-LoRA         | 86.1         | [command and logs](XXXX)|
| NVIDIA GPU  | deepseek-ai/deepseek-llm-7b-chat | PPO (Megatron)    | 69.5 [1]     | [log](XXXX), [wandb](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2-7B-Instruct           | GRPO              | 89           | [script](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2-7B-Instruct           | GRPO (FSDP2)      | 89.8         | [log](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2-7B-Instruct           | GRPO (Megatron)   | 89.6         | [log](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2.5-7B-Instruct         | ReMax             | 97           | [script](XXXX), [wandb](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2.5-7B-Instruct         | SPPO              | 65.6 (MATH)  | [SPPO script](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2.5-7B-Instruct         | GRPO-LoRA         | 93.4         | [command and logs](XXXX)|
| NVIDIA GPU  | Mixtral-8x22B-Instruct-v0.1      | Instruct model    | 83.7         | [Qwen Blog](XXXX) |
| NVIDIA GPU  | Mixtral-8x22B-Instruct-v0.1      | RLOO (Megatron)   | 92.3         | [wandb](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2.5-7B-Instruct         | SPIN              | 92           | [script](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2-7B-Instruct           | GPG               | 88           | [log](XXXX), [wandb](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2-7B-Instruct           | GPG (Megatron)    | 88           | [log](XXXX), [wandb](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2.5-VL-7B-Instruct      | GRPO (Megatron)   | 65.4 (GEO3k) | [script](XXXX), [wandb](XXXX) |
| AMD MI300   | deepseek-ai/deepseek-llm-7b-chat | PPO               | 70.5 [1]     | [log](XXXX) |
| AMD MI300   | deepseek-ai/deepseek-llm-7b-chat | GRPO              | 71.4 [1]     | [log](XXXX) |
| NVIDIA GPU  | Qwen/Qwen2.5-14B-Instruct         | GRPO-LoRA         | 94.6         | [command and logs](XXXX)|
| NVIDIA GPU  | Qwen/Qwen2.5-32B-Instruct         | GRPO-LoRA         | 95.8         | [command and logs](XXXX)|
| NVIDIA GPU  | Qwen/Qwen2.5-72B-Instruct         | GRPO-LoRA         | 96.0         | [command and logs](XXXX)|

### DAPO math-17k

- Training DAPO math-17k dataset: XXXX
- Testing: AIME'24: XXXX

Note:
- For Qwen/Qwen2.5-Math-7B, we directly modify the max_position_embeddings to 32768 without observing performance degradation in order to train longer response length.

| Hardware    | Model                       | Method                  | Test score | Details |
|-------------|-----------------------------|-------------------------|------------|---------|
| NVIDIA GPU  | Qwen/Qwen2.5-Math-7B (32k)  | DAPO                    | 36.3       | [command](XXXX), [logs](XXXX)|
| NVIDIA GPU  | Qwen/Qwen2.5-7B-Instruct    | DAPO + Code Interpreter | 40.0       | [command](XXXX)|




## Coding related datasets

Below is the result on leetcode if not specified otherwise.

| Hardware    | Model                            | Method            | Test score   | Details |
|-------------|----------------------------------|-------------------|--------------|---------|
| NVIDIA GPU  | PRIME-RL/Eurus-2-7B-SFT          | RPIME             | 36.1         | [script](XXXX), [swanlab](XXXX) |


### Notes

[1] During evaluation, we have only extracted answers following the format `"####"`. A more flexible answer extraction, longer response length, and better prompt engineering may lead to a higher score.

[2] The default value of `actor_rollout_ref.actor.entropy_coeff` is set to `0.0` since verl 0.3.x on 2025-05-30, which is different from previous versions.
