# SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models

This repository provides the implementation of **SPELL: Self-Play Reinforcement Learning for Evolving Long-Context Language Models**. SPELL is a novel methodology that leverages self-play reinforcement learning to enhance the capabilities of large language models, specifically targeting improvements in long-context understanding and reasoning tasks.

## 🛠️ Requirements

To set up the project environment, please follow the steps below. These commands will create a dedicated Conda environment and install all necessary dependencies.

```bash
# Create the conda environment
conda create -n spell python==3.10.16
conda activate spell

# Install verl
cd verl
pip install -e .

# Install vLLM
pip install vllm==0.8.5.post1

# Install flash-attn
pip install flash-attn --no-build-isolation

cd ..
pip install -r requirements.txt
```

## 🚀 Quick Start

This section provides a comprehensive guide to using the SPELL model, from dataset preparation to training and evaluation.

### 🗂️ Dataset

The training process utilizes the following datasets, and we provide processed daatsets in the `dataset/` directory:

* **`dataset/docmath_qa/train.parquet`**: A training dataset for mathematical reasoning over financial documents.
* **`dataset/ultra_fineweb/train.parquet`**: A large-scale, high-quality web dataset used for pre-training language models.

For more details, please refer to the our Appendix in our paper.

For evaluation, you will need to download the following benchmarks from HuggingFace and ensure the paths are correctly configured in `eval/generate.py`:

* **`yale-nlp/DocMath-Eval`**
* **`THUDM/LongBench`**
* **`THUDM/LongBench-v2`**
* **`Tongyi-Zhiwen/frames`**

### 💻 Training

To begin training, start a multi-node Ray cluster using the provided script:

```bash
scripts/start_ray.sh
```

The training scripts in the `scripts/` directory are optimized for a distributed setup consisting of nodes equipped with 8 x 80G NVIDIA A100 GPUs.

After training, you need to transform the model from fsdp to hf model.

```bash
verl/scripts/model_merge.sh
```

#### Implementation Details

Our codebase is built upon **VeRL**. The key modifications and additions for SPELL are located in the following files:

**Trainer:**

* `verl/verl/trainer/main_ray_spell.py` (new)
* `verl/verl/trainer/ppo/ray_spell_trainer.py` (new, core SPELL training loop)
* `verl/verl/trainer/config/spell_trainer.yaml` (new, configuration file)
* `verl/verl/trainer/ppo/core_algos.py` (lines 199-373, role-specific reward & advantage estimation)
* `verl/verl/trainer/ppo/metric_utils.py` (lines 80-447, role-specific metrics)

**Dataset:**

* `verl/verl/utils/dataset/spell_dataset.py` (new, data collector & history memory)
* `verl/verl/utils/dataset/prompts.py` (new, prompt templates)

**Workers:**

* `verl/verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py` (line 257, 275-279, specify rollout number)
* `verl/verl/workers/actor/dp_actor.py` (lines 370-371, ensure once update for each training batch)


### 📊 Evaluation

We conduct a thorough evaluation on six long-context Document Question Answering (DocQA) benchmarks. These include multiple-choice (LongBench-v2), multi-hop reasoning (2WikiMultihopQA, HotpotQA, MuSiQue, Frames), and financial report reasoning (DocMath) tasks. The final score is reported as the maximum of the cover exact match and an LLM-judged accuracy score provided by `gpt-oss-120b`.

The evaluation process is detailed in `eval/scripts/eval_single.sh` and consists of two main steps.

#### Step 0: Prepare Datasets

Before beginning, ensure you have downloaded the necessary evaluation datasets from HuggingFace and have correctly updated the file paths within the `eval/generate.py` script.

#### Step 1: Generate Model Outputs

This step generates the model's responses for the evaluation benchmarks. The script is configured to test the model on two different input length settings (16K and 100K tokens).

```bash
#!bin/bash
export CUDA_VISIBLE_DEVICES=0,1,2,3
export VLLM_WORKER_MULTIPROC_METHOD='spawn'

PROJ_DIR="eval"

TP=0.7
TOP_P=0.95
TOP_K=-1

TASKS_LIST=("docmath" "frames" "2wikimqa" "hotpotqa" "musique" "longbench-v2") 

N_SAMPLES=8

MAX_OUTPUT_LEN=20000

MODEL_NAME=Qwen3-30B-A3B-Thinking-2507
MODEL_PATH="<your_model_path>/${MODEL_NAME}"

mkdir -p ${PROJ_DIR}/results

# test in two different input length settings
for MAX_INPUT_LEN in 16384 100000
do

    SAVE_NAME="${MODEL_NAME}_I${MAX_INPUT_LEN}_O${MAX_OUTPUT_LEN}_N${N_SAMPLES}"

    # Step1: generate model outputs
    python ${PROJ_DIR}/generate.py \
        --input_dir "${PROJ_DIR}/data" \
        --save_dir "${PROJ_DIR}/results" \
        --save_file ${SAVE_NAME} \
        --model "${MODEL_PATH}" \
        --tokenizer "${MODEL_PATH}" \
        --tasks "${TASKS_LIST[@]}" \
        --n_sampling ${N_SAMPLES} \
        --temperature ${TP} \
        --top_p ${TOP_P} \
        --max_input_len ${MAX_INPUT_LEN} \
        --max_output_len ${MAX_OUTPUT_LEN} \
        --gpu_memory_utilization 0.9 \
        --top_k ${TOP_K} \
        --split ${N_SAMPLES} 

done

```

#### Step 2: Generate Evaluation Results

To obtain the LLM-judged accuracy, you first need to serve the `gpt-oss-120b` model as an OpenAI API endpoint.

**Serve the Judge Model (`gpt-oss-120b`)**

Use the following command to serve the judge model on your local machine using vLLM.

```bash
VLLM_USE_TRITON_FLASH_ATTN=1 # Flag to control if you wantAI Inference Server to use Triton Flash Attention.
VLLM_FLASH_ATTN_VERSION=3 # Force AI Inference Server to use a specific flash-attention version (2 or 3), only valid with the flash-attention backend.

vllm serve openai/gpt-oss-120b \
    --served-model-name gpt-oss-120b \
    --host 0.0.0.0 \
    --async-scheduling \
    --tensor-parallel-size 2 \
    --max-model-len 16384 \
    --port 23547

```

**Run the Evaluation Script**

Once the judge model is running, execute the script below to verify the generated outputs and calculate the final scores. Remember to replace `<your_api_host>` and `<your_api_port>` with the actual host and port of your judge model endpoint.

```bash
#!bin/bash
PROJ_DIR="eval"

TP=0.7
TOP_P=0.95
TOP_K=-1

TASKS_LIST=("docmath" "frames" "2wikimqa" "hotpotqa" "musique" "longbench-v2") 

JUDGE_MDOEL="gpt-oss-120b"
VERIFIER_HOST="<your_api_host>"
VERIFIER_PORT="<your_api_port>"
API_BASE="http://${VERIFIER_HOST}:${VERIFIER_PORT}/v1"

N_SAMPLES=8

MAX_OUTPUT_LEN=20000

MODEL_NAME=Qwen3-30B-A3B-Thinking-2507
MODEL_PATH="<your_model_path>/${MODEL_NAME}"

mkdir -p ${PROJ_DIR}/results

# test in two different input length settings
for MAX_INPUT_LEN in 16384 100000
do

    SAVE_NAME="${MODEL_NAME}_I${MAX_INPUT_LEN}_O${MAX_OUTPUT_LEN}_N${N_SAMPLES}"

    # generate llm-as-judge score
    python ${PROJ_DIR}/verify.py \
        --save_dir "${PROJ_DIR}/results" \
        --save_file ${SAVE_NAME} \
        --model "${JUDGE_MDOEL}" \
        --tasks "${TASKS_LIST[@]}" \
        --temperature 0.0 \
        --n_proc 200 \
        --top_p 1.0 \
        --max_input_len 8192 \
        --max_output_len 8192 \
        --top_k -1 \
        --api_key "EMPTY" \
        --api_base ${API_BASE} 

done

```
