
# RLPR: Extrapolating RLVR To General Domains


## 📌Contents <!-- omit in toc -->

- [RLPR: Extrapolating RLVR To General Domains](#rlpr-extrapolating-rlvr-to-general-domains)
  - [Dataset](#dataset)
  - [Install](#install)
  - [Train](#train)
    - [Resuming Training](#resuming-training)
  - [Evaluation](#evaluation)
    - [Convert checkpoints to HuggingFace format model](#convert-checkpoints-to-huggingface-format-model)

## Dataset

Training and evaluation datasets are supplied with the supplementary materials.

## Install

```bash
bash scripts/setup_env.sh
```

## Train

1. Prepare data

Download the train and test dataset. Move `rlpr_train.parquet` to `./datasets/train`, and move all the test datasets to `./datasets/test`.


2. Specify the base model path in `examples/RLPR/reproduce_<model>.sh`, where `<model>` can be `qwen`, `llama` and `gemma`.
```bash
MODEL=path_to_base_model
```

3. (Optional) Login wandb and set USE_WANDB to True in the `examples/RLPR/reproduce_<model>.sh` if you want to use wandb for logging.

```bash
USE_WANDB=${USE_WANDB:-"false"}
```

4. (Optional) Follow the following steps to use the `llm as a judge` eval method. Skip this step if you want to use a rule-based verifier to judge the answer.
	- Open-Source Model as judge
	    1. Create a new environment for the server and deploy the model. (Specify judge model, host and port in the `setup_server.sh`)
	        
	        ```shell
	        bash scripts/setup_server.sh
	        ```
	        
	    2. Specify the judge model in the `examples/RLPR/reproduce_<model>.sh`.
	        
	        ```shell
	        export CLIENT_IP=http://127.0.0.1:8001
            export USED_MODEL=Qwen/Qwen2.5-72B-Instruct
	        ```
	- API-Based Model (gpt-4.1-mini) as judge 
		
        Specify token and the judge model in the `examples/RLPR/reproduce_<model>.sh` to use OpenAI API.
        
        ```shell
        export OPENAI_API_KEY=your_api_token
        export OPENAI_API_BASE=your_api_base  # default is https://api.openai.com/v1
        export USED_MODEL=gpt-4.1-mini
        ```

5. Run the training script

```shell
bash examples/RLPR/reproduce_qwen.sh
# bash examples/RLPR/reproduce_llama.sh
# bash examples/RLPR/reproduce_gemma.sh
```


### Resuming Training

If you need to continue training from a specific training step, navigate to the checkpoint save directory (default is `data/checkpoints/`), modify the value in the `latest_checkpointed_iteration.txt` file to the target step, and then rerun the training script.

## Evaluation

1. Follow the steps 1~4 in the [Train](#train) section to prepare the data, model and judge model (optional).

2. Run the evaluation script

```shell
bash examples/RLPR/reproduce_qwen.sh +trainer.val_only=True
# bash examples/RLPR/reproduce_llama.sh +trainer.val_only=True
# bash examples/RLPR/reproduce_gemma.sh +trainer.val_only=True
```

### Convert checkpoints to HuggingFace format model

Run the code below:
```shell
python scripts/model_merger.py --local_dir <checkpoint_folder>/<exp_name>/global_step_<step>/actor --target_dir <target_dir>
```


## Licenses <!-- omit in toc -->


[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/LICENSE)
[![Data License](https://img.shields.io/badge/Data%20License-CC%20By%20NC%204.0-red.svg)](https://github.com/tatsu-lab/stanford_alpaca/blob/main/DATA_LICENSE)