

This repository is the official implementation of TimeHC-RL (Distilabel (Data Generation) + TRL (SFT) + VeRL (GRPO)).

**Requirements**

Environment for RL (VeRL Framework):

Python 3.10.14

```cmd
pip install torch==2.4.0 --index-url https://download.pytorch.org/whl/cu121
pip3 install vllm==0.6.3 ray
pip3 install flash-attn --no-build-isolation
pip install -e .  # For verl integration
pip install wandb IPython matplotlib
pip install torchdata
pip install modelscope
wandb init
```



Environment for SFT (The TRL framework used in Open R1)

CUDA 12.4 nvcc--version

```
pip install vllm==0.8.4
pip install setuptools
pip install flash-attn --no-build-isolation
pip install -e ".[dev]"
git-lfs --version
```



**Data Preprocessing and generation for training**

1. Data format conversion and split

2. Long-thought SFT data generation

3. Direct SFT data generation

4. RL data generation

We provide code for data generation tailored to different training methods, along with specific Parquet format data files available for inspection under the `rl/data` directory. Additionally, preliminary data processing implementations are provided in the `data format conversion and split` directory.



**Training**

SFT:

```cmd
accelerate launch --main_process_port=29502 --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \
    --model_name_or_path .../Qwen-7B \
    --dataset_name .../SFTData/direct_test \
    --learning_rate 5.0e-5 \
    --num_train_epochs 3 \
    --max_seq_length 16384 \
    --per_device_train_batch_size 1 \
    --gradient_checkpointing \
    --bf16 \
    --output_dir data/Qwen2.5-7B-direct
```



RL:

```shell
set -x

#Qwen2.5-7B-Instruct-1M
CHECKPOINT_PATH=.../Qwen2.5-7B-Instruct-1M

export VLLM_ATTENTION_BACKEND=XFORMERS

cd .../verl/
python3 -m verl.trainer.main_ppo \
    algorithm.adv_estimator=grpo \
    data.train_files=.../train.parquet \
    data.val_files=.../test.parquet \
    data.train_batch_size=8 \
    data.max_prompt_length=1536 \
    data.max_response_length=2048 \
    actor_rollout_ref.model.path=$CHECKPOINT_PATH \
    actor_rollout_ref.actor.optim.lr=3e-7 \
    actor_rollout_ref.model.use_remove_padding=True \
    actor_rollout_ref.actor.ppo_mini_batch_size=128 \
    actor_rollout_ref.actor.use_dynamic_bsz=True \
    actor_rollout_ref.actor.ppo_max_token_len_per_gpu=16384 \
    actor_rollout_ref.actor.use_kl_loss=True \
    actor_rollout_ref.actor.kl_loss_coef=0.001 \
    actor_rollout_ref.actor.kl_loss_type=low_var_kl \
    actor_rollout_ref.model.enable_gradient_checkpointing=True \
    actor_rollout_ref.actor.fsdp_config.param_offload=False \
    actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
    actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
    actor_rollout_ref.rollout.name=vllm \
    actor_rollout_ref.rollout.gpu_memory_utilization=0.6 \
    actor_rollout_ref.rollout.temperature=1.0 \
    actor_rollout_ref.rollout.n=8 \
    actor_rollout_ref.rollout.enforce_eager=True \
    actor_rollout_ref.rollout.free_cache_engine=False \
    actor_rollout_ref.ref.fsdp_config.param_offload=True \
    algorithm.kl_ctrl.kl_coef=0.001 \
    trainer.critic_warmup=0 \
    trainer.logger=['console','wandb'] \
    trainer.project_name='' \
    trainer.experiment_name='' \
    trainer.n_gpus_per_node=8 \
    trainer.nnodes=1 \
    trainer.save_freq=500 \
    trainer.test_freq=25 \
    trainer.total_epochs=3 $@ 2>&1 | tee qwen2.5_1M_old_parameter_adaptive_long.log
```

We conduct reinforcement learning based on the VERL framework. First, we generate RL training data using `preprocess.py` under the `rl/data` directory. Then, we introduce debugging for the Ray distributed framework in the `main_ppo.py` file. The `ray_trainer.py` file contains a relatively complete training pipeline. We modify the reward function implementation in `rl/verl/utils/reward_score`, and `rl/verl/workers/fsdp_workers.py` includes more detailed implementations such as the `update_actor` method.

**Note:** The Open R1 and VeRL frameworks contain numerous files and complex dependencies. In the supplementary material, we have uploaded the specific files we modified that reflect the core of our proposed method. To run the code successfully, it is necessary to download the remaining dependency files from the official frameworks (which we did not modify) and place them in the correct directories.



**Test-Time Scaling**

We provide implementations for both parallel scaling(`test-time scaling/majority.py`) and sequential scaling(`test-time scaling/vllm_inference_budget_tomi.py`), which can be run directly.



**Model Evaluation**

We provide corresponding evaluation implementations for models called through APIs, such as OpenAI-O3 and DeepSeek-R1(`model evaluation/model_evaluation_api.py`), as well as for local models(`model evaluation/model_evaluation_local.py`). The evaluation code can be run directly.