# Better, Faster: Harnessing Self-Improvement in Large Reasoning Models

This repository contains the code for our paper submitted to ICLR2026.

## Requirements and Installation

- PyTorch version >= 2.4.0
- Python version >= 3.10
- To install **LLaMA-Factory** and develop locally:

``` bash
git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics,vllm]" --no-build-isolation
```
- To install **open-r1** and develop locally:

``` bash
git clone https://github.com/huggingface/open-r1.git
cd open-r1
pip install -e ".[dev]"
```

## File Description

The file organization of this project is as follows:
```
HSIR
├── data
│   ├── GSM8K
│   ├── MedQA
│   └── testset
├── H-GRPO
│   ├── open_r1
│   ├── recipes
│   ├── scripts
│   └── trl
├── script
│   ├── HSIR-DPO_GSM8K.sh
│   ├── HSIR-DPO_MedQA.sh
│   ├── HSIR-SFT_GSM8K.sh
│   └── HSIR-SFT_MedQA.sh
└── src
    ├── GSM8K
    └── MedQA
```

- **`./data`**: the seed data distilled from DeepSeek-R1, the unlabeled training data, and the test data for the MedQA and GSM8K.
- **`./src`**: the training code using HSIR for the MedQA and GSM8K tasks.
- **`./script`**: the training scripts for SFT and DPO training on the MedQA and GSM8K.
- **`./H-GRPO`**: the implementation and training script of H-GRPO.

More specifically, in the `./data/MedQA`, there are four data files:
- **`./data/MedQA/medqa_train_ds_seed.json`**: the seed data distilled from DeepSeek-R1.
- **`./data/MedQA/medqa_train_ds_unlabeled.json`**: the unlabeled dataset without any reasoning paths.
- **`./data/MedQA/medqa_train_ds_distilled.json`**: all reasoning data distilled from DeepSeek-R1, i.e., the combination of seed data and unlabeled data (with the reasoning paths). This is used to train the SFT-Oracle models in our paper.
- **`./data/MedQA/medqa_grpo_unlabeled.json`**: the unlabeled data used to perform the GRPO training.

## Getting Started
Here, we introduce how to perform the self-improvement post-training with our **HSIR** method using 8 GPUS. Notably, the pretrained models should be placed in `/workspace/huggingface`.

### HSIR for SFT training

We provide the SFT training script of HSIR in `./script`, you can directly train your models using the following commands:
``` 
# For the MedQA task
model=$1  # model name, e.g., 'Qwen2.5-1.5B-Instruct'
save_path=$2 # the path of saved model, e.g., '/workspace/output/medqa_sft/Qwen2.5-1.5B-Instruct'
bash /workspace/HSIR/script/HSIR-SFT_MedQA.sh $model $save_path

# For the GSM8K task
model=$1  # model name, e.g., 'Qwen2.5-1.5B-Instruct'
save_path=$2 # the path of saved model, e.g., '/workspace/output/gsm8k_sft/Qwen2.5-1.5B-Instruct'
bash /workspace/HSIR/script/HSIR-SFT_GSM8K.sh $model $save_path
```

### HSIR for DPO training

Similarly, you can directly train your models using the following commands:
``` 
# For the MedQA task
model=$1  # model name, e.g., 'Qwen2.5-1.5B-Instruct'
save_path=$2 # the path of saved model, e.g., '/workspace/output/medqa_dpo/Qwen2.5-1.5B-Instruct'
bash /workspace/HSIR/script/HSIR-DPO_MedQA.sh $model $save_path

# For the GSM8K task
model=$1  # model name, e.g., 'Qwen2.5-1.5B-Instruct'
save_path=$2 # the path of saved model, e.g., '/workspace/output/gsm8k_dpo/Qwen2.5-1.5B-Instruct'
bash /workspace/HSIR/script/HSIR-DPO_GSM8K.sh $model $save_path
```

### RL training with H-GRPO

To perform this process, you should first prepare the training environment as:

``` 
# replace the grpo_trainer.py in the trl package
rm -r trl/trainer/grpo_trainer.py
mv /workspace/HSIR/H-GRPO/trl/grpo_trainer.py trl/trainer/grpo_trainer.py

# replace the training code in the open-r1 package
rm -r open-r1/src/open_r1
mv /workspace/HSIR/H-GRPO/open_r1 open-r1/src/
```

We provide the training script of H-GRPO in `./H-GRPO/scripts`. Specifically, you can start RL training with H-GRPO using the following commands:

``` 
# For the MedQA task
model=$1  # the path of initial SFT model, e.g., '/workspace/output/medqa_sft/Qwen2.5-1.5B-Instruct/medqa_train_ds_seed'
save_path=$2 # the path of saved model, e.g., '/workspace/output/medqa_gsm8k/Qwen2.5-1.5B-Instruct'
bash /workspace/HSIR/H-GRPO/scripts/H-GRPO_MedQA.sh $model $save_path

# For the GSM8K task
model=$1  # the path of initial SFT model, e.g., '/workspace/output/gsm8k_sft/Qwen2.5-1.5B-Instruct/gsm8k_train_ds_seed'
save_path=$2 # the path of saved model, e.g., '/workspace/output/grpo_gsm8k/Qwen2.5-1.5B-Instruct'
bash /workspace/HSIR/H-GRPO/scripts/H-GRPO_GSM8K.sh $model $save_path

```

## Pipeline of HSIR
Taking the SFT training on the MedQA task as an example, we introduce the detailed pipeline of HISR in this part. Specifically,
``` 
# Step1: obtain the self-generated solutions.
/workspace/HSIR/src/MedQA/step1_rejection_sampling.sh 

# Step2: obtain the verify-then-exit sampling solutions.
/workspace/HSIR/src/MedQA/step2_verify-then-exit_sampling.sh

# Step3: verify the correctness of solutions, and split them into right group and wrong group.
/workspace/HSIR/src/MedQA/step3_splict_right_wrong.py

#Step4: calculate the InDiv score for correct solutions.
/workspace/HSIR/src/MedQA/step4_cal_indiv_score.sh

#Step5: filter the data with lower InDiv scores.
/workspace/HSIR/src/MedQA/step5_filter_data.py 

#Step6: prepare the SFT dataset.
/workspace/HSIR/src/MedQA/step6_sft_data_prepare.py

# Step7: fine-tune the base model on the seed data and the above self-generated SFT dataset.

# Step8: evaluate the tuned model on the in-distribution dataset.
/workspace/HSIR/src/MedQA/step8_model_inference.sh
``` 