# CAPO

The implementation for the paper: **"CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment"**


---

## 🛠️ Installation

### 1. Environment Setup
First, create a fresh Conda environment and install the core dependencies following the `verl` installation guidelines.

```bash
# Create and activate conda environment
conda create -n verl python==3.12
conda activate verl

# Install vLLM and SGLang dependencies
SE_MEGATRON=0 bash scripts/install_vllm_sglang_mcore.sh

# Install the current package in editable mode
cd verl
pip install --no-deps -e .
```

### 2. Install Verification Tools
We use `math-verify` to enable rule-based verification for mathematical reasoning tasks.

```bash
pip install "math-verify[antlr4_9_3]==0.8.0"
```

---

## 🚀 Cluster Setup & Usage

This project utilizes a **Ray cluster** for distributed training and **vLLM** for efficient Generative PRM (Qwen-2.5-14/32B-Instruct) serving.

### Step 1: Start Ray Cluster

You need to start a Head Node and (optionally) Worker Nodes.

**On the Head Node:**
Replace `$PORT` with your desired port (e.g., `6379`).
```bash
ray start --head \
    --port=$PORT \
    --min-worker-port=20122 \
    --max-worker-port=20999
```

**On Worker Nodes (if using multi-node training):**
Connect to the head node using its IP address.
```bash
# Replace $HEAD_IP and $PORT with the actual IP and port of the Head Node
ray start --address=$HEAD_IP:$PORT \
    --min-worker-port=20122 \
    --max-worker-port=20999
```

### Step 2: Start vLLM Serving

We use vLLM to serve the Generative PRM or policy models. Run the following command on **each node** dedicated to inference (e.g., if you have 4 inference nodes).

```bash
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup vllm serve \
    --config verl/recipe/prm/config_vllm.yaml \
    > /dev/null 2>&1 &
```

### Step 3: Update Configuration

After starting the vLLM services, you must manually register the node addresses.

1.  Get the IP address/URL of the nodes where you deployed vLLM.
2.  Open the file: `recipe/genrm_remote/reward_function_batch.py`.
3.  Update the address list within the file to match your deployed node endpoints.

### Step 4: Run Training Example

We provide running examples using the Qwen2.5-Math-1.5B or Qwen2.5-Math-7B model.

```bash
python3 -m recipe.prm.main_prm --config-name CAPO_Qwen1_5B
python3 -m recipe.prm.main_prm --config-name CAPO_Qwen7B
```

