<div align="center">

# RoRecomp

<div>
🚀 Enhancing reasoning efficiency via Roll Response Recomposition 🌟
</div>
</div>
<div>
<br>



</div>


## Overview

This repository contains two open-source projects: [DeepScaleR](https://github.com/agentica-project/rllm) and [verl](https://github.com/volcengine/verl). Specifically, `DeepScaleR` provides the training data, while `verl` provides the reinforcement learning training framework. 

## Training data
Please download the DeepScaleR-Preview-Dataset from [huggingface](https://huggingface.co/datasets/agentica-org/DeepScaleR-Preview-Dataset) and then move the data to the `/data` directory.

## Installation

### Install verl

Please refer to [verl document](https://verl.readthedocs.io/en/latest/start/install.html) for installation guidance.

The version of packages used in our experiments:
```
python==3.10
pytorch==2.6.0
vllm==0.8.5 or 0.6.3
transformers==4.51.3
```

### Install RoRecomp
```
pip install -e .
```

## Training Script

We provide the script to train RoRecomp on DeepSeek-R1-Distill-Qwen 1.5B.
```
# start ray
ray start --head 

bash scripts/train/run_deepseek_1.5b_8k.sh
```

## Explanation
Our RoRecomp is easy to implement with a bit of code. The key part is in the `fit` function of [rat_trainer.py](verl/verl/trainer/ppo/ray_trainer.py), where we recompose the experiences (i.e., model's responses with advantages) into priority and compensation batches.

## Evaluation
We use [Qwen-Math](https://github.com/QwenLM/Qwen2.5-Math) to evaluate performance on mathematical benchmarks, and [lighteval](https://github.com/huggingface/lighteval) to evaluate performance on GPQA and LiveCodeBench. 