# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

## Project Overview

MindSpeed RL is a reinforcement learning acceleration framework based on the Ascend ecosystem, designed to provide end-to-end RL training and inference solutions for Huawei Ascend chip ecosystem partners. It supports core acceleration capabilities such as ultra-large Ascend cluster training/inference co-card/separate deployment, multi-model asynchronous pipeline scheduling, and heterogeneous training/inference partitioning communication.

## Common Commands

### Environment Setup
```bash
# Install dependencies
pip install -r requirements.txt

# For development, you may need to install additional dependencies
pip install antlr4-python3-runtime==4.7.2 --no-deps
```

### Running Training
```bash
# Start GRPO training with Qwen2.5-7B model
bash examples/grpo/grpo_trainer_qwen25_7b.sh

# Start DAPO training with Qwen2.5-32B model
bash examples/dapo/dapo_trainer_qwen25_32b.sh

# Start PPO training with Qwen2.5-32B model
bash examples/ppo/ppo_trainer_qwen25_32b.sh

# Start DPO training with Qwen3-30B-A3B model
bash examples/dpo/dpo_trainer_qwen3_30b_a3b.sh
```

### Data Preprocessing
```bash
# Preprocess Math-17k dataset
bash examples/data/preprocess_data.sh math_17k

# Preprocess DeepScaler dataset
bash examples/data/preprocess_data.sh deepscaler
```

### Running Tests
```bash
# Run system tests
bash tests/st/st_run.sh

# Run unit tests
# (Specific test commands would depend on the test framework used)
```

## High-Level Architecture

The MindSpeed RL framework is organized into several key components:

### Core Modules

1. **mindspeed_rl/** - Main package containing all RL functionality
   - **config_cls/** - Configuration classes for different components
   - **datasets/** - Dataset handling and preprocessing utilities
   - **models/** - Model implementations including actor, critic, reference, and reward models
   - **trainer/** - Training loop implementations for different RL algorithms
   - **utils/** - Utility functions and helper classes
   - **workers/** - Worker implementations for distributed training
   - **models/loss/** - Loss function implementations for different RL algorithms

### Training Algorithms

1. **GRPO (Group Relative Policy Optimization)** - Primary algorithm with integrated worker support
2. **DAPO (Direct Alignment with POst-hoc rewards)** - Direct alignment algorithm
3. **PPO (Proximal Policy Optimization)** - Standard PPO implementation
4. **DPO (Direct Preference Optimization)** - Direct preference optimization

### Key Features

1. **Integrated Worker** - Shared card deployment where Actor, Reference, and other workers time-share the same machine resources
2. **Data Module** - Centralized data management system connecting inference and training frameworks
3. **Resharding** - Weight reshaping capabilities for different parallelization strategies
4. **Remove Padding** - Padding removal optimization for better performance
5. **Context Parallel** - Long sequence parallelization support

### Configuration System

The framework uses Hydra for configuration management with YAML files organized in the `configs/` directory:
- **configs/model/** - Model architecture configurations
- **configs/datasets/** - Dataset preprocessing configurations
- **configs/envs/** - Environment variable configurations
- **configs/checkpoint/** - Checkpoint configurations

### Worker Architecture

The framework implements a worker-based architecture where different components run as separate workers:
- **ActorHybridWorker** - Handles generation, log probability computation, and model updates
- **ReferenceWorker** - Computes reference log probabilities
- **RewardWorker** - Computes rewards (when using reward models)
- **RuleReward** - Computes rewards using rule-based verifiers
- **IntegratedWorker** - Combined worker for co-card deployment

## Key Configuration Parameters

### RL Configuration (rl_config)
- `use_integrated_worker` - Enable shared card deployment
- `n_samples_per_prompt` - Number of samples per prompt for reusing data
- `max_prompt_length` - Maximum prompt length for training
- `mini_batch_size` - Mini-batch size for actor updates
- `clip_ratio` - Clipping ratio for policy updates

### Generation Configuration (generate_config)
- `infer_tensor_parallel_size` - Tensor parallelism for inference
- `max_tokens` - Maximum tokens for generation
- `temperature` - Sampling temperature
- `top_p` - Top-p sampling parameter

### Environment Variables (configs/envs/runtime_env.yaml)
- `VLLM_DP_SIZE` - vLLM data parallelism size
- `HCCL_SOCKET_IFNAME` - HCCL communication interface
- `TASK_QUEUE_ENABLE` - Task queue optimization level

## Testing Structure

Tests are organized in the `tests/` directory:
- **tests/st/** - System tests with end-to-end training scenarios
- **tests/ut/** - Unit tests for individual components
- **tests/configs/** - Test-specific configuration files

System tests typically run complete training workflows to verify functionality.