# Progressive Training System

## Overview

ProSetting is a progressive reinforcement learning training system supporting both VERL PPO and TRL DPO frameworks. The system enables iterative training of Solver models with complete modular architecture and semi-automated control, implementing weight transfer between rounds and data augmentation.

## Architecture

### Core Components
- **Solver Model**: Math problem-solving model supporting VERL PPO and TRL DPO training
- **Teacher Models**: Teacher1 (grading) and Teacher2 (question enhancement)
- **Training Frameworks**:
  - **VERL PPO**: Original reinforcement learning training engine
  - **TRL DPO**: Direct preference optimization framework (recommended)

### Training Flow

#### TRL DPO Flow (Recommended)
```
Questions → Solver → Teacher1 → DPO Triplets → Next Round
    ↓         ↓         ↓           ↓            ↓
Collection  Grading  Cartesian   Parquet     Weight Update
```

## Project Structure

```
ProSetting/
├── run_training.py           # Unified training launcher
├── auto_trainer.py           # Fully automated TRL trainer (recommended)
├── semi_auto_trainer_trl.py  # Semi-automated TRL trainer
├── collectors/               # Data collection modules
│   ├── trajectory_collector.py  # Trajectory collector
│   └── data_normalizer.py       # Data normalizer
├── processors/               # Data processing modules
│   ├── reward_calculator.py     # Reward calculator
│   ├── question_enhancer.py     # Question enhancer
│   └── solver_data_processor.py # Solver data processor
├── datasets/                 # Dataset modules
│   ├── dpo_data_converter.py    # DPO data converter
│   ├── dpo_data_generator.py    # DPO data generator
│   └── data_saver.py            # Data saver
├── trainers/                 # Training modules
│   ├── trl_trainer.py           # TRL trainer
│   └── gpu_manager.py           # GPU manager
├── managers/                 # Management modules
│   ├── round_controller.py      # Round controller
│   └── question_manager.py      # Question manager
├── core/                     # Core modules
│   └── state_manager.py         # State manager
├── Teacher_Model/           # Teacher model clients
├── utils/                   # Utility scripts
├── DataSet/                 # Dataset files
├── .env.example             # Environment configuration example
└── requirements.txt          # Project dependencies
```

## Quick Start

### Environment Setup

```bash
# 1. Install dependencies
pip install -r requirements.txt

# 2. Configure environment variables
cp .env.example .env
# Edit .env file with key parameters:
# SOLVER_MODEL_PATH=/path/to/solver/model
# QUESTIONS_FILE=/path/to/questions.json
# WORKSPACE_DIR=/path/to/workspace
# TRL_NUM_PROCESSES=8
# TEACHER_BASE_URL=http://your-teacher-api

# 3. Verify environment
python utils/status_checker.py --quick
```

### Running Training

#### 1. Unified Launcher (Recommended)
```bash
cd /home/project/ProSetting

# Fully automated training (default)
python run_training.py

# Semi-automated training
python run_training.py --mode semi
```

#### 2. Direct Training Scripts
```bash
cd /home/project/ProSetting

# Fully automated TRL training (recommended)
python auto_trainer.py

# Semi-automated TRL training
python semi_auto_trainer_trl.py
```

#### 3. System Testing
```bash
# Quick system logic test
python utils/test_runner.py

# System status check
python utils/status_checker.py

# Quick status check
python utils/status_checker.py --quick
```

## Training Features

### Fully Automated TRL Training (Recommended)

- **Complete Automation**: No manual intervention required
- **Smart Retry**: Configurable retry mechanism with intervals
- **Error Recovery**: Option to skip or stop on training failures
- **Checkpoint Recovery**: Resume from any stage
- **Resource Management**: Automatic GPU memory cleanup
- **Detailed Logging**: Complete training process records and final reports
- **Signal Handling**: Graceful shutdown support (Ctrl+C)

### Semi-Automated TRL Training

TRL DPO training uses direct preference optimization, more stable than PPO:

- **Data Format**: Convert grading results to DPO triplets (prompt, chosen, rejected)
- **Training Method**: Distributed training using accelerate + TRL
- **Data Storage**: Parquet format for efficient reading
- **Weight Transfer**: Automatic model weight management between rounds
- **Training Schedule**: First 2 rounds no training (data accumulation), training starts from round 3
- **Checkpoint Recovery**: 5 fine-grained stage checkpoint protection and recovery

### Training Stages
1. **Data Collection**: Multi-GPU parallel solver trajectory collection
2. **Data Grading**: Teacher1 batch grading with 32 concurrent processing
3. **DPO Conversion**: Convert grading results to DPO triplets, save as parquet
4. **TRL Training**: Distributed training using accelerate + TRL (skip first 2 rounds)
5. **Next Round Prep**: Build next round question pool with enhanced questions and failed replays

## Configuration

### Default Configuration
```python
{
    "max_rounds": 5,                    # Total training rounds
    "save_rounds": [3, 4, 5],          # Checkpoint save rounds
    "attempts_per_question": 8,         # Attempts per question
    "physical_solver_gpu": "4",         # Solver model GPU
    "physical_grpo_gpu": "0,1,2,3,4,5,6,7",  # Training GPUs
    "training_framework": "TRL_DPO",   # Training framework
    "trl_num_processes": 8,            # TRL training processes
    "trl_mixed_precision": "bf16"      # Mixed precision training
}
```

## Core Features

### 1. Inter-Round Weight Transfer
- Round 1 uses original weights
- Round 2+ automatically loads previous round results
- Supports FSDP distributed weight auto-merging

### 2. Progressive Question Enhancement
- Teacher1 batch grading for reward calculation
- Teacher2 generates enhanced questions based on error analysis
- Failed question replay + random replay mechanism

### 3. Data Persistence
- All training data permanently saved
- Standardized file naming conventions
- Training state recovery support

### 4. Modular Architecture
- Separated data collection, processing, training, and management modules
- Independent testing and maintenance support
- Complete error handling mechanisms

## Training Strategy

### Round Configuration
- **Total Rounds**: 5 rounds (configurable)
- **Training Strategy**: First 2 rounds data accumulation, training starts from round 3
- **Weight Strategy**: Round 3 uses original weights, subsequent rounds use previous outputs
- **Save Strategy**: Automatic model weight saving after each training round

### Question Pool Management
- **Round 1**: Original questions (depends on data file)
- **Round 2+**: Enhanced questions + failed replays + random replays
- **Question Growth**: Question pool gradually expands each round
- **Enhancement Strategy**: Teacher2 generates enhanced questions based on error analysis, 32 concurrent processing
- **Replay Mechanism**: 100% failed question replay, full non-failed question replay

### GPU Allocation
- **Solver Collection**: Configurable GPU (default GPU 4)
- **TRL Training**: GPU 0-7 (8-card parallel using accelerate)
- **Memory Management**: Automatic cleanup and release between stages
- **Parallel Strategy**:
  - Data collection: Multi-GPU parallel with intelligent task allocation
  - Grading processing: 32 concurrent Teacher API calls
  - Question enhancement: 32 concurrent Teacher2 processing

## Troubleshooting

### Common Issues

1. **Model path not found**
   ```bash
   export SOLVER_MODEL_PATH="/correct/path/to/model"
   ```

2. **GPU memory insufficient**
   - Check GPU usage: `nvidia-smi`
   - Adjust batch_size or reduce parallelism

3. **Checkpoint merge failure**
   - Check checkpoint directory permissions
   - Confirm FSDP weight files are complete

4. **Training interruption recovery**
   ```bash
   # Fully automated training recovery
   python auto_trainer.py
   
   # Semi-automated training recovery
   python semi_auto_trainer_trl.py
   
   # Check recovery status
   python utils/status_checker.py
   ```

### Log Files
- **TRL Training Log**: `/tmp/trl_trainer.log`
- **Automated Training Log**: `/tmp/auto_trainer.log`
- **Training Output**: Real-time console output
- **State Files**: `{WORKSPACE_DIR}/training_state.json`
- **Round Progress**: `{WORKSPACE_DIR}/round_XX_progress.json`
- **Training Results**: `{WORKSPACE_DIR}/training_results/`
- **Training Summary**: `{WORKSPACE_DIR}/auto_training_summary.json`
- **Checkpoint Files**: `{WORKSPACE_DIR}/checkpoint_round_X.json`

## Development Guide

### Adding New Modules
1. Create new file in appropriate directory
2. Implement standard interfaces and error handling
3. Update corresponding `__init__.py` exports
4. Add unit tests

### Custom Training Strategies
1. Modify `RoundController` configuration
2. Adjust question pool building logic
3. Customize reward calculation functions

### Extending Data Formats
1. Update `StateManager` file naming
2. Modify data save and load logic
3. Ensure backward compatibility

## License

This project follows internal use license, for research and development only.

## Support

For questions or suggestions, please contact the development team or check project documentation.

---

**Note**: This system has completed modular refactoring, reorganizing the original `modules` directory into clearer functional modules:
- `collectors/` - Data collection related
- `processors/` - Data processing related  
- `datasets/` - Dataset management related
- `trainers/` - Training execution related
- `managers/` - Round and question management related
- `core/` - Core state management

This new directory structure is more intuitive and easier to understand and maintain.
