
# Codebase and Anonymity Notice

This submission is based on a specific version of the public **VERL (Volcano Engine Reinforcement Learning)** library. For full transparency and to allow for the precise verification of our modifications, the provided codebase is derived from the VERL repository at **commit hash `38d9a88170786a45cb189a08290c4651e6d6f671`**. For the convenience of reviewers, we have included a pruned version of the library, removing files irrelevant to the reproduction of our results.

**Important:** The included base library code contains original, unaltered identifying information as it existed at the specified commit hash. This includes copyright notices, internal developer comments (e.g., `TODO` notes), and metadata in files such as `setup.py` which may contain **usernames, author names, emails, or external hyperlinks**.

**These have been intentionally preserved in their entirety** to respect the original license terms and attribution. Any such identifiers refer exclusively to the original developers of the VERL library, **NOT** the authors of this submission.

Our own contributions (detailed below) have been carefully anonymized.

# Summary of Modifications

Our novel contributions are implemented in the following ways:

1.  **New files written by us:** All newly created files are located in the `recipe/osft` directory, unless specified otherwise.

2.  **Modifications to existing VERL files:** To integrate our method, we have modified several existing files.

3.  **New files adapted from other open-source projects:** We have included functionalities adapted from existing open-source code and have preserved their original licenses as required.

Below is a precise, line-by-line breakdown of our changes. And make sure you are in the `osft` base directory when you read the paths.


## 1. New Files Written by Us

### Summary of New Files

Inside the `recipe/osft/` directory, we have the following new files:

- `main_osft.py`: The main training script for our method, adapted from `main_ppo.py` in VERL. This script orchestrates the training process using our proposed techniques.

- `dp_actor.py`: Custom actor class for our method, adapted from `verl/workers/actor/dp_actor.py` in VERL. This is the core part for calculating the osft loss.

- `fsdp_workers.py`: Custom FSDP worker class for our method, adapted from `verl/workers/fsdp_workers.py` in VERL. We only modified the `init_model` function to load osft's actor.

- `osft_trainer.py`: Custom trainer class for our method, adapted from `verl/trainer/ppo/ray_trainer.py` in VERL. This is the core part for whole process of osft training, including training loop in `fit` function, validation in `_validate` function, and `_validate_base` function for checkpoints validation.

- `generation_same_validate.py`: Implements the validation logic for a checkpoint of all kind of models (base mode, or OSFT, GRPO, DAPO, Dr. GRPO trained model) on the benchmark dataset. This code is adapted from `main_osft.py`

- `config/osft_trainer.yaml`: Configuration file for our training setup, which specifies default hyperparameters and settings specific to our method.

In `examples/` directory, we have the following new files:

- `osft_1e7_dsr_tau0s6.sh`: Example bash script to run training with specific hyperparameters and $\tau_s=0.6$ for osft.
- `grpo_1e7_dsr_tau1s.sh`: Example bash script to run training with specific hyperparameters and $\tau_s=1$ for grpo.
- `drgrpo_1e7_dsr_tau1s`: Same as above, but for Dr. GRPO.
- `dapo_1e7_dsr_tau1s`: Same as above, but for DAPO.
- `all_models_tauv`: Evaluate all base models with different evaluation temperature $\tau_{eval}$.

## 2. Modifications to Existing VERL Files

### Modified Files Summary

- `verl/trainer/ppo/ray_trainer.py`: (whole function) modify `_validate` function to log pass@1, pass@8 scores and perplexity for 16 prompt-response pairs from the benchmark.
- `verl/trainer/ppo/metric_utils.py`: (line 430 - 527) added `process_benchmark_metrics` to compute pass@1 and pass@8 (here is `avg_pass@k`)
- `recipe/dapo/dapo_ray_trainer.py`: (whole function) modify `_validate` function to log pass@1, pass@8 scores and perplexity for 16 prompt-response pairs from the benchmark
- `verl/utils/reward_score/__init__.py`: (line 39 - 59) modify `default_compute_score` function to add the logic for additional dataset (benchmark and training set) to call the verifier.
- `verl/workers/actor/dp_actor.py`: (line 455 - 741) added `_forward_micro_batch_logits` to calculate the logits and other metrics calculation logic for benchmark subset.
- `verl/workers/fsdp_workers.py`: (line 742 - 792) added `compute_logits_norm_n_entropy` to aggregate the logits and other metrics for benchmark subset.
- `verl/workers/reward_manager/dapo.py`: (line 96 - 129) modified `DAPORewardManager` class to handle the reward calculation for DAPO training.


## 3. Included Third-Party Modules

###  Verifier for evaluation.
-   Directory: `verl/utils/reward_score/entropy_math/`
-   Origin: As stated in **Appendix A.3 Verifier** of our main paper, these files are the official implementation of the math verifier from **Cui et al. (2025)**.
-   Note on Modifications: We have made minor, non-conceptual modifications to the original file `__init__.py` line 1064 - 1089 to integrate it with our codebase. These changes do not alter the core functionality of the verifier.
-   Reason for Inclusion: The module is included directly to ensure full reproducibility of our results.
-   Anonymity Note: All original copyright notices and licenses within these files have been preserved. **They refer to the authors of that work, NOT the authors of this submission.**


## 4. Datasets

The processed datasets used in our experiments are included in the `data/` directory. These datasets are derived from publicly available sources, as detailed in our main paper.

### Locations of Datasets

inside the `data/` directory, we have the following datasets:

Training sets:

- DeepScaleR: `deepscaler/train.parquet`
- Openthoughts math-only: `OpenThoughts3-1.2M/math_question/train.parquet`

Validation sets (the benchmark):

- Math500: `benchmarks/math500.parquet`
- AMC: `benchmarks/amc.parquet`
- OlympiadBench (Olympiad): `benchmarks/olympiadbench.parquet`
- Minerva Math (Minerva): `benchmarks/minerva.parquet`
- AIME24: `benchmarks/aime.parquet`
- AIME25: `benchmarks/aime25.parquet`

# How to Run

## Environment Setup

make sure you are in `osft` base directory, then run:

```
conda create -n osft python==3.10
conda activate osft

cd verl
pip install torch==2.6.0
pip install -e .
pip install flash-attn --no-build-isolation
```

And if the process is not successful, you can check our conda enviroment exported from the runnable environment `osft_envs/environment.yml`.

## Training or Evaluating

make sure you are in `osft` directory, then run one of the bash scripts in `examples/` directory, for example:

```
conda activate osft

# training
bash examples/osft_1e7_dsr_tau0s6.sh

# evaluation
bash examples/all_models_tauv.sh
```
