# SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

*This is an anonymous submission for conference review.*

## Introduction
Recent advances in reinforcement learning have shown that language models can develop sophisticated reasoning through training on tasks with verifiable rewards, but these approaches depend on expert-curated problem-answer pairs and domain-specific reward engineering.

We introduce SPIRAL, a self-play framework where models learn
by playing **multi-turn, zero-sum games against continuously improving versions of themselves**, eliminating the need for human supervision. Through zero-sum self-play, SPIRAL generates an **_infinite curriculum_** of progressively challenging problems as models must constantly adapt to stronger opponents.

Applying SPIRAL to Qwen3 base models in two-player zero-sum text games, we observe the agents develop advanced reasoning strategies to win the competitive game. Furthermore, the trained models show substantial gains on a range of math and general reasoning benchmarks.
These results suggest that self-play in zero-sum games can naturally induce transferable reasoning capabilities, highlighting a promising direction for autonomous reasoning development.

<p align="center"><img src="assets/fig1-1.png" width="100%" /></p>

## Architecture

<p align="center"><img src="assets/framework.png" width="90%" /></p>

SPIRAL employs an actor-learner architecture for scalable self-play training. Parallel actors sample trajectories from a diverse set of games using vectorized environments. A single policy $\pi_t$ plays both roles, generating zero-sum, sparse reward game trajectories. The centralized learner processes these trajectories using Role-conditioned Advantage Estimation (RAE) to compute separate advantages, $A_0(s,a)$ and $A_1(s,a)$, for each role. These are then used for on-policy reinforcement learning updates.

## Usage

### Installation
```bash
# prepare environment
conda create -y -n spiral python=3.10
conda activate spiral

# install dependencies
pip install vllm==0.8.4 && pip install oat-llm==0.2.1
pip install -e .
```

### Training

```bash
bash run.sh
```
This training script runs SPIRAL on the Kuhn Poker environment for 400 policy iteration steps. It has been tested on an 8×H100 GPU setup. During training, we evaluate model performance online using three metrics:

1) Win rate against a fixed opponent on the training game;

2) Win rate against a fixed opponent on an out-of-domain game ;

3) Accuracy on math reasoning benchmarks.

Example result curves are shown below.

<p align="center"><img src="assets/curves.png" width="100%" /></p>

### Evaluation

In addition to the online evaluation, we provide offline evaluation across a broader range of benchmarks to assess the model's OOD game and general reasoning capabilities.

**Game evaluation**
```
# we rely on openrouter to play against other models
export OPENROUTER_API_KEY=""

# Add your models to the batch_run.sh
bash evals/game/batch_run.sh
```

**Benchmark evaluation**
```
cd evals/benchmarks
# Add your models to the batch_run.sh
bash batch_run.sh
```

## Acknowledgement
* The language games are sampled from [TextArena](https://github.com/LeonGuertler/TextArena), a collection of competitive text-based games for language model evaluation and reinforcement learning.
* The multi-agent, multi-turn RL training is implemented with 🌾 [Oat](https://github.com/sail-sg/oat), a modular and research-friendly LLM RL framework.
* The base models are from [Qwen3](https://huggingface.co/Qwen/Qwen3-4B).
