## Introduction
### Repository Structure
- `data/` contains the train, validation, and test datasets.
- `train/` contains the training code, including the RL training code based on verl and the SFT baseline training code based on trl.
- `evaluation/` contains the evaluation toolkit used for reproducing the evaluation and analysis results.


## Environment Setup

```shell
cd RISE_local
conda create -y -n rise python=3.12.2 && conda activate rise
pip3 install ray[default]
pip3 install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn==2.7.4.post1 --no-build-isolation
pip3 install omegaconf==2.4.0.dev3 hydra-core==1.4.0.dev1 antlr4-python3-runtime==4.11.0 vllm==0.7.3
pip3 install math-verify[antlr4_11_0]==0.7.0 fire deepspeed tensorboardX prettytable datasets
cd train/verl
pip3 install -e .
cd ../../
```
## RL Training

* Data Preparation

  ```shell
  DATA_DIR=data/train
  # check the input data path set in generate_splits.py
  python3 train/verl_utils/data/generate_splits.py --local_dir $DATA_DIR
  ```

* Start Ray

  ```shell
  # Use the code below if you are running across multiple machines
  # Head node (×1)
  ray start  --address=$HEAD_ADDR:6379 --node-ip-address=$WORKER_ADDR --num-gpus=8
  # Worker nodes (xN)
  ray start  --head --port=6379  --node-ip-address=$HEAD_ADDR --num-gpus=8
  
  # Use the code below if you are running on one machine
  ray start  --head  --num-cpus=8  --dashboard-port=8265  --dashboard-host=0.0.0.0
  ```

* Launch training at head node (See `train/verl_scripts` for the complete set of training scripts)
  ```shell
  # Qwen2.5-3B example

  # Set the input/output directories and wandb/huggingface token accordingly in the script file 
  cd train/verl_scripts
  sh start_qwen3b_rise.sh # rise
  sh start_qwen3b_zero_rl.sh # zero-rl
  ```

* Merge the model files after training
  ```shell
  cd train/verl_utils
  # Set the trained model path in run_ckpt_merge.sh
  sh run_ckpt_merge.sh
  ```

## SFT Baseline Training
```shell
cd train/sft_baseline/scripts

# 1. Set the base model path in run_training_multiple_qwen.sh
# 2. Check the train dataset path is set properly in run_training_multiple_qwen.sh
# 3. Set the output model path in ../args/Qwen2.5-1.5B-SFT.json, ../args/Qwen2.5-3B-SFT.json, ../args/Qwen2.5-7B-SFT.json

sh run_training_multiple_qwen.sh
```

## Evaluation
```shell
cd evaluation/scripts

# 1. Set the output models directory (model_path) and target model names (models) in run_eval_auto.sh
# 2. Set the other configurations properly if not using default configurations

sh run_eval_auto.sh # this would output a series of evaluation metrics on each benchmark shown in the paper
```

## Analysis
* **Comparison to off-the-shelf verifiers (math-shepherd)**

  ```shell
  cd evaluation/scripts
  # 1. Set the path to math-shepherd in run_verify_rm.sh
  # 2. Set the target generation files to verify in run_verify_rm.sh (GEN_DIRS)
  sh run_verify_rm.sh # this would output the verification acc.
  ```

* **Impact of Verification Compute**
  
  To reproduce the results of increasing verification train-time compute, one can directly adjust the `+data.critique_batch_size` value in the verl training scripts. For example:
  ```shell
  python3 -m verl.trainer.main_ppo \
  ...
  +data.critique_batch_size=256 # (for 25%)
  #+data.critique_batch_size=512 # (for 50%)
  #+data.critique_batch_size=1024 # (for 100%)
  ...
  +trainer.online_critique=True \
  ...
  ```

* **Online and Offline Verification**
  1. Use the trained Zero-RL model (ckpt 96) to inference over the MATH-Hard dataset (follow the instructions in Evaluation)
  2. Locate the result generation file and build the offline dataset
  3. Follow the instructions in RL Training to start the training of offline RISE models, with `+trainer.online_critique=False` set in the script file
    ```shell
    cd train/verl_utils/data
    
    # Usage: python3 reproduce_offline_dataset.py /path/to/input_file.jsonl /path/to/output/folder
    # Set the model_size in reproduce_offline_dataset.py accordingly ("1.5B", "3B", "7B")
    python3 reproduce_offline_dataset.py /path/to/generation.jsonl ../../../data/train
    ```
* **Enhanced Verification for Reasoning**
  
  Follow the instructions in `evaluation/analysis/reflection.ipynb` to reproduce the analysis results.
