<div align="center">
  <img src="train/assets/page.jpg" alt="Logo" width="500">
</div>



# SPEC-RL: Accelerating On-Policy Reinforcement Learning via Speculative Rollouts

## Installation

Use **either** Docker (fastest) **or** a Conda environment aligned with **verl 0.5.x**.

### Option A — Docker (recommended)

Use this prebuilt image (no further steps here):

**`verlai/verl:app-verl0.5-vllm0.9.1-mcore0.12.2-te2.2`**

### Option B — Conda (VERL 0.5.x)

Follow verl's official 0.5.x installation guide to set up the environment (PyTorch, vLLM, etc.):

https://verl.readthedocs.io/en/v0.5.x/start/install.html#install-dependencies


## Training
### 1) Move into the train directory

```bash
cd train
```

Two shell scripts are under training_scripts/. Please configure paths, and then **launch**. All training data used in our paper can be found in the `train/data` directory.

### 2) Configure the scripts

There are **two** scripts to edit before running:

#### (A) Vanilla GRPO baseline

1. Set **your own project root**:

   File: `training_scripts/vanilla-grpo/1.7B-grpo.sh`

```bash
# inside training_scripts/vanilla-grpo/1.7B-grpo.sh
PROJECT_DIR=path-to-root-project-dir
MODEL_PATH=path-to-your-base-model-path
SAVE_PATH=path-to-your-save-path
PROJECT_NAME=your-custom-project-name
```

2. Set GRPO main script:

File: `training_scripts/train_grpo.sh`

Set the following variables to **your own paths**:

```bash
# inside training_scripts/train_grpo.sh
MODEL_PATH=path-to-default-base-model-dir
CHECKPOINT_PATH=path-to-default-save-model-path
```

#### (B) SPEC-RL

1. Set **your own project root** and **SPEC-RL parameters ⚡**:

   File: `training_scripts/spec-rl/1.7B-grpo-lenience-0.5.sh`

```bash
# inside training_scripts/spec-rl/1.7B-grpo-lenience-0.5.sh
PROJECT_DIR=path-to-root-project-dir
MODEL_PATH=path-to-your-base-model-path
SAVE_PATH=path-to-your-save-path
PROJECT_NAME=your-custom-project-name

# turn on speculative decoding mode and lenience
--spec_decoding True \
--bias 0.5 \
```

2. Set specl-rl GRPO main script:

File: `training_scripts/train_grpo-spec-sampling.sh`

Set the following variables to **your own paths** and default **SPEC-RL parameters** :

```bash
# inside training_scripts/train_grpo-spec-sampling.sh
MODEL_PATH=path-to-default-base-model-dir
CHECKPOINT_PATH=path-to-default-save-model-path

SPEC_DECODING=False
BIAS=0.0
```

### 3) Login to Weights & Biases

```bash
wandb login
```

### 4) Launch training

**Vanilla GRPO baseline** (recommended first run):

```bash
bash training_scripts/vanilla-grpo/1.7B-grpo.sh
```

**SPEC-RL (with speculative decoding) GRPO**:

```bash
bash training_scripts/spec-rl/1.7B-grpo-lenience-0.5.sh
```

After the first run, monitor logs under `logs/` (and your W\&B project if enabled).

## Evaluation

Once training is complete, you can evaluate your checkpoints with the scripts under the `eval/` directory.

### 1) Move into the eval directory

```bash
cd eval
```

### 2) Configure eval script

Edit `eval/eval_scripts/example.sh` to set up your environment and paths:

``` 
cd eval

# install dependencies
pip install pebble word2number timeout_decorator jieba matplotlib
cd latex2sympy && pip install -e . && cd ..

# environment variables
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
CKPT_DIR=path-to-your-ckpt-dir
BASE_MODEL=path-to-your-base-model-dir   # used for step-0 evaluation
WANDB_PROJECT=your-wandb-project-name

# launch evaluation
bash eval/eval_scripts/eval_math_nodes.sh \
    --run_name ${CKPT_DIR} \
    --init_model ${BASE_MODEL} \
    --template qwen-boxed \
    --tp_size 1 \
    --add_step_0 true \
    --temperature 1.0 \
    --top_p 0.95 \
    --max_tokens 16000 \
    --benchmarks amc23,aime24,aime25,math500,gsm8k,minerva_math,olympiadbench,mmlu_stem \
    --n_sampling 1 \
    --wandb_project ${WANDB_PROJECT}

```

### 3) Run evaluation

```bash
bash eval/eval_scripts/example.sh
```

After execution, evaluation logs and metrics will be saved under the corresponding `${CKPT_DIR}/eval_results/` directory.
 You can inspect JSON/CSV results directly or visualize them with the plotting scripts.
