

# 🌻Acknowledgement
We implement our reinforcement learning algorithm extending from [veRL](https://github.com/volcengine/verl). We utilize [vLLM](https://github.com/vllm-project/vllm) for inference. 

## Installation

You can install dependencies by running the following commands:

```
conda env create -n lp-reg -f environment.yaml
```

## Training

Before training, you need to ensure that the AIME, AIME25 datasets are with "data_source" of "aime", "aime25" and respectively. As we hardcode it to make sure they are rollouted with temperature of 0.6.

For training Qwen3-14B on multi nodes, you can run:

```
cd Lp-Reg
conda activate lp-reg
bash examples/qwen2.5_32b_kl_minp_8k_wods_64gpu.sh
```

While for training Qwen2.5-32B on multi nodes, you can run:

```
cd Lp-Reg
conda activate lp-reg
bash examples/qwen3_14b_kl_minp_8k_wods_32gpu.sh
```

## Evaluation

For evaluation, we assess model performance across five diverse mathematical reasoning benchmarks: AIME24, AIME25, MATH-500, OlympiadBench, and Minerva Math.

For AIME24 and AIME25, which have smaller test sets, we use sampled decoding with a temperature of 0.6 and generate 16 independent responses per problem. For the remaining benchmarks, including MATH-500, OlympiadBench, and Min-
erva, we utilize greedy decoding to evaluate performance.



