## Quick Start


## README added by wangpy
Running experiments: 
```bash
bash eval_math_nodes.sh     --run_name Qwen2.5-7B_AIME24_temp0.6_n256_seed1_num_hint-level0_source-r1     --init_model Qwen2.5-7B     --template qwen-boxed      --tp_size 1     --add_step_0 false      --temperature 0.6     --top_p 0.95     --max_tokens 16000     --benchmarks aime24     --n_sampling 256     --just_wandb false     --num_shots 0     --shot_source r1     --seed 1
```
特别需要注意，上述指令的run_name和init_model没有起效果，具体是在eval_math_nodes.sh的line 216 - line 249中覆盖了。

shell脚本嵌套了三层shell，只需要修改最外层/mnt/afs/codes/wangpy/limit-of-RLVR/math/eval_math_nodes.sh 和第三层：/mnt/afs/codes/wangpy/limit-of-RLVR/math/examples/math_eval/sh/eval_multi_gpu.sh
在最外层代码中，需要修改line 216 - line 249
init_model_path： 需要加载的model的路径
base_checkpoint_path：代码的log path

第三层代码：
num_samples: 样本数量，用于切片
num_gpus: 几块卡
72B需要额外修改参数：pipeline_parallel_size

加载的数据集目前支持：
aime24，aime24EX7（Qwen2.5-7B无法回答的7题，添加了hint），aime24EX7_withouthint（Qwen2.5-7B无法回答的7题原题，没有hint）及其他在examples/math_eval/data下的数据



### Installation

```bash
conda create -n verl python==3.9
conda activate verl
pip3 install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu124
pip3 install flash-attn --no-build-isolation
pip3 install -e . 
```

### Generation & Evaluation

For *base* models:
```bash
bash eval_math_nodes.sh \
    --run_name Qwen2.5-7B_minerva_math_temp0.6_n32_seed1 \
    --init_model Qwen2.5-7B \
    --template qwen-boxed  \
    --tp_size 1 \
    --add_step_0 true  \
    --temperature 0.6 \
    --top_p 0.95 \
    --max_tokens 16000 \
    --benchmarks minerva_math \
    --n_sampling 32 \
    --just_wandb false \
    --seed 1
```

For *SimpleRL-Zoo* models:
```bash
bash eval_math_nodes.sh \
    --run_name Qwen-2.5-7B-SimpleRL-Zoo_minerva_math_temp0.6_n32_seed1 \
    --init_model Qwen-2.5-7B-SimpleRL-Zoo \
    --template qwen-boxed  \
    --tp_size 1 \
    --add_step_0 true  \
    --temperature 0.6 \
    --top_p 0.95 \
    --max_tokens 16000 \
    --benchmarks minerva_math \
    --n_sampling 32 \
    --just_wandb false \
    --seed 1
```

For *Oat-Zero* models:
```bash
bash eval_math_nodes.sh \
    --run_name Qwen2.5-Math-7B-Oat-Zero_minerva_math_temp0.6_n32_seed1 \
    --init_model Qwen2.5-Math-7B-Oat-Zero \
    --template qwen-boxed  \
    --tp_size 1 \
    --add_step_0 true  \
    --temperature 0.6 \
    --top_p 0.95 \
    --max_tokens 16000 \
    --benchmarks minerva_math \
    --n_sampling 32 \
    --just_wandb false \
    --seed 1
```


For *DAPO* models:
```bash
bash eval_math_nodes.sh \
    --run_name DAPO-Qwen-32B_minerva_math_temp0.6_n32_seed1 \
    --init_model DAPO-Qwen-32B \
    --template abel  \
    --tp_size 4 \
    --add_step_0 true  \
    --temperature 0.6 \
    --top_p 0.95 \
    --max_tokens 16000 \
    --benchmarks minerva_math \
    --n_sampling 32 \
    --just_wandb false \
    --seed 1
```

#### Prompts

Here we use the template `qwen-boxed` for *SimpleRL-Zoo*, *Oat-Zero* series and corresponding *base* models. The original *Oat-Zero* paper used slightly different prompts from those in *SimpleRL-Zoo*, but our tests showed similar performance. For the sake of simplicity, we directly adopted the prompts from *SimpleRL-Zoo*. 

We use the template `abel` for *DAPO* series and corresponding *base* models. We used the old-version prompts and model weights of *DAPO* as provided in [DAPO](https://huggingface.co/BytedTsinghua-SIA/DAPO-Qwen-32B/blob/38b8075427d3f7d12075377d1e40495875066189/inference/example.json). They have since updated both the prompts and model weights. 

All the templates are given in `examples/math_eval/utils.py` and you may change them if needed.

#### *pass@k*

After running the script, the evaluation results will be saved in `examples/math_eval/EVAL/checkpoints/$RUN_NAME/eval_results`, with the metrics saved in `$RUN_NAME/eval_results/eval_results.csv`. A useful function to get the *pass@k* data is given in `pass@k.py`. You can modify it and get the *pass@k* data.

**An example is given in `pass@k.py` to print the *pass@k* data. You can follow the example.**

#### Sampling Details

**An important matter** is that since we use the same model to sample multiple times on the same dataset, it is essential to ensure that the responses obtained from different runs are different, as well as the responses from different samplings within a single run. To this end, the functionality has been integrated into the top-level interface, and you only need to pass parameters in the following manner.

To make responses from different runs distinct, simply set the random seed as follows:

```bash
bash eval_math_nodes.sh \
    --seed 1 # The seed you set should be different
```

To ensure that responses from different samplings within a single run differ, simply pass the number of samplings for a single run as follows, without needing to perform any other actions:

```bash
bash eval_math_nodes.sh \
    --n_sampling 32 \
```

### Applicability

The framework is applicable to *SimpleRL-Zoo*, *Oat-Zero*, *DAPO* series and corresponding *base* models.
