# R-Diverse: Self-Evolution Training System

R-Diverse is a reinforcement learning-based self-evolution training framework that continuously improves mathematical reasoning capabilities through alternating training of **Questioner (problem generator)** and **Solver (problem solver)**.

## 📋 Table of Contents

- [System Architecture](#system-architecture)
- [Training Pipeline](#training-pipeline)
- [Core Scripts](#core-scripts)
- [Quick Start](#quick-start)
- [Configuration](#configuration)

## System Architecture

```
R-Diverse
├── scripts/                    # Training scripts
│   ├── main.sh                 # Main entry (self-evolution loop)
│   ├── questioner_train_penalty.sh  # Questioner training
│   └── solver_train.sh         # Solver training
├── verl/trainer/               # Trainer core
│   ├── main.py                 # Training entry point
│   └── ray_trainer.py          # Ray-based PPO/GRPO trainer
├── vllm_service_init/          # vLLM inference services
│   ├── start_vllm_server.py    # Solver vLLM service
│   └── start_vllm_server_code.py  # Code generation vLLM service
├── examples/reward_function/   # Reward functions
│   └── caller_penalty.py       # Questioner reward (with diversity penalty)
├── memory_bank/                # Memory Bank management
│   └── update_memory.py        # Update historical question bank
└── question_generate/          # Question generation module
    └── question_generate.bash  # Batch question generation
```

## Training Pipeline

### Overall Pipeline

```
┌──────────────────────────────────────────────────────────────────┐
│                   Self-Evolution Loop (5 rounds)                 │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  ┌─────────────────┐      ┌─────────────────────────────────┐   │
│  │  Questioner v_i │ ───► │         Solver v_i              │   │
│  │    Training     │      │         Training                │   │
│  └─────────────────┘      └─────────────────────────────────┘   │
│          │                              │                        │
│          │                              │                        │
│          ▼                              ▼                        │
│  ┌─────────────────┐      ┌─────────────────────────────────┐   │
│  │ Questioner v_{i+1}│◄── │       Solver v_{i+1}            │   │
│  │    Training     │      │         Training                │   │
│  └─────────────────┘      └─────────────────────────────────┘   │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘
```

### Detailed Training Steps

#### 1. Questioner Training (`questioner_train_penalty.sh`)

Train the Questioner to generate high-quality, diverse math problems.

```
Questioner Training Pipeline:
┌─────────────────────────────────────────────────────────────┐
│ 1. Start vLLM Services (Sleep Mode)                         │
│    └── Solver vLLM (ports 5000-5007)                        │
│    └── Code vLLM (ports 6000-6007, code mode)               │
│                                                             │
│ 2. GRPO Training (5 steps)                                  │
│    ├── Generate questions + answers                         │
│    ├── Call Solver to evaluate uncertainty                  │
│    ├── Compute Batch Penalty (current batch diversity)      │
│    ├── Compute Memory Penalty (historical diversity)        │
│    └── Update Questioner                                    │
│                                                             │
│ 3. Merge model weights                                      │
└─────────────────────────────────────────────────────────────┘
```

**Questioner Reward Formula:**
```
Reward = Uncertainty_Score - α × Batch_Penalty - β × Memory_Penalty
```

- **Uncertainty_Score**: Solver's confidence in answering the question (closer to 0.5 is better)
- **Batch_Penalty**: Similarity penalty within current batch
- **Memory_Penalty**: Similarity penalty against historical questions in Memory Bank

#### 2. Solver Training (`solver_train.sh`)

Train the Solver to solve problems generated by the Questioner.

```
Solver Training Pipeline:
┌─────────────────────────────────────────────────────────────┐
│ Module 1: Question Generate                                 │
│    └── Generate 1000 questions using current Questioner     │
│                                                             │
│ Module 2: Evaluate                                          │
│    └── Evaluate question quality using current Solver       │
│                                                             │
│ Module 3: Update Memory Bank                                │
│    └── Add high-quality questions to Memory Bank            │
│    └── Compute question embeddings (NL or Code mode)        │
│                                                             │
│ Module 4: Upload                                            │
│    └── Upload dataset to HuggingFace                        │
│                                                             │
│ Module 5: Experience Replay (optional)                      │
│    └── Mix historical data to prevent forgetting            │
│                                                             │
│ Module 6: GRPO Training (15 steps)                          │
│    └── Train Solver to solve generated problems             │
│                                                             │
│ Module 7: Final Evaluation                                  │
│    └── Evaluate Solver performance on test set              │
└─────────────────────────────────────────────────────────────┘
```

## Core Scripts

### `scripts/main.sh`

Main entry script that executes 5 rounds of self-evolution training.

```bash
# Usage
bash scripts/main.sh <Base_model_path> <Model_abbr>

# Example
bash scripts/main.sh /path/to/Qwen2.5-7B-Instruct qwen7b_exp1

# Resume from checkpoint
RESUME_FROM=solver_v2_step10 bash scripts/main.sh /path/to/model qwen7b_exp1
```

**Main Configuration:**
| Environment Variable | Description | Default |
|---------------------|-------------|---------|
| `TOTAL_GPU_COUNT` | Total number of GPUs | 8 |
| `EMBEDDING_TYPE` | Embedding type (nl/code) | code |
| `PENALTY_ALPHA` | Batch Penalty weight | 1.0 |
| `PENALTY_BETA` | Memory Penalty weight | 1.0 |
| `REPLAY_STRATEGY` | Experience replay strategy | post_eval |
| `RESUME_FROM` | Resume checkpoint position | (empty) |

### `scripts/questioner_train_penalty.sh`

Questioner training script.

```bash
# Parameters
$1: Solver model path
$2: Questioner model path  
$3: Save path name
$4: Iteration number
$5: Resume checkpoint path (optional)
```

### `scripts/solver_train.sh`

Solver training script with complete data preparation and training pipeline.

```bash
# Parameters
$1: Solver model path
$2: Questioner model path
$3: Experiment name
$4: Iteration number
$5: Resume checkpoint path (optional)
```

### `verl/trainer/main.py`

Trainer entry point that initializes Ray distributed environment and configuration.

### `verl/trainer/ray_trainer.py`

Ray-based PPO/GRPO trainer implementation:
- Supports FSDP distributed training
- Implements KL penalty and advantage estimation (GAE/GRPO/RLOO)
- Supports time-sharing mode with vLLM for GPU sharing

### `examples/reward_function/caller_penalty.py`

Questioner's reward function with core features:
- Calls Solver vLLM to compute uncertainty scores
- Computes Batch Penalty (based on BLEU or Code Embedding)
- Computes Memory Penalty (similarity with Memory Bank)
- Supports vLLM Sleep/Wake mode to save GPU memory

### `memory_bank/update_memory.py`

Updates the Memory Bank:
- Supports NL (Natural Language) and Code (Python code) embedding modes
- Multi-GPU parallel code generation
- FIFO policy to manage historical data size

## Quick Start

### 1. Environment Setup

```bash
# Install dependencies
pip install -r requirements.txt

# Configure tokens.json
cat > tokens.json << EOF
{
    "huggingface": "your_hf_token",
    "wandb": "your_wandb_key"
}
EOF
```

### 2. Configure Storage Path

```bash
# Modify in main.sh
export STORAGE_PATH="/path/to/your/storage"
export HUGGINGFACENAME="your_hf_username"
```

### 3. Start Training

```bash
# Full training (5 rounds of self-evolution)
bash scripts/main.sh /path/to/base_model model_name

# Train Questioner only
bash scripts/questioner_train_penalty.sh \
    /path/to/solver_model \
    /path/to/questioner_model \
    experiment_name \
    1

# Train Solver only
bash scripts/solver_train.sh \
    /path/to/solver_model \
    /path/to/questioner_model \
    experiment_name \
    1
```

## Configuration

### GPU Time-Sharing Mode

R-Diverse uses GPU time-sharing mode where vLLM services and Trainer share GPUs:

```
Timeline:
┌────────────────────────────────────────────────────────────┐
│  Trainer Running │  vLLM Wake │  vLLM Running │ vLLM Sleep │
│  (GRPO Update)   │            │  (Inference)  │            │
│  ████████████    │  ▓▓▓▓     │  ████████     │  ▓▓▓▓      │
│  Using GPU       │  Load Weights │  Using GPU │  Free Memory│
└────────────────────────────────────────────────────────────┘
```

### Embedding Modes

| Mode | Description | Use Case |
|------|-------------|----------|
| `nl` | Natural Language Embedding (BGE) | Text-based question similarity |
| `code` | Code Embedding (Jina) | Solution code-based similarity |

### Experience Replay Strategies

| Strategy | Description |
|----------|-------------|
| `none` | No experience replay |
| `pre_eval` | Mix historical data before Evaluate |
| `post_eval` | Mix historical data before training |

## Directory Structure

Files generated during training:

```
$STORAGE_PATH/
├── models/                     # Model checkpoints
│   ├── {model_abbr}_questioner_v1/
│   │   └── global_step_5/actor/huggingface/
│   ├── {model_abbr}_solver_v1/
│   │   └── global_step_15/actor/huggingface/
│   └── ...
├── memory_bank/                # Memory Bank
│   └── {model_abbr}/
│       ├── questions.json      # Question list (NL mode)
│       ├── embeddings.npy      # Embedding vectors (NL mode)
│       ├── question_code.json  # Questions + code (Code mode)
│       └── embedding_code.npy  # Code embeddings (Code mode)
├── generated_question/         # Generated questions
└── temp_results/               # Temporary files
```

## License

Apache License 2.0
