# Installation

The headline experiments use vLLM 0.12.0 + PyTorch 2.9.0 + VERL 0.7.0 (PyPI)
+ FlashAttention 2.8.3 inside a Python 3.12 venv. The setup script
`scripts/setup_vllm012_venv.sh` handles the install, including a small patch
to vLLM's LoRA Triton kernel needed on B200 / SM100a (deactivates PDL gating
to avoid `gdc_wait()` codegen failures on Blackwell).

## Hardware

| Stage | Recommended | Minimum |
|---|---|---|
| SFT (Stage 1) | 4 × B200 80 GB or 4 × H200 144 GB | 4 × A100 80 GB (slower; will require lower micro-batch) |
| Vanilla GSPO (Stage 2a) | 8 × B200 80 GB | 8 × H200 144 GB |
| Novelty GSPO (Stage 2b) | 4 × B200 80 GB | 4 × H200 144 GB |
| Eval (pass@32) | 4–8 GPUs (data-parallel via vLLM); CPU-only `compute_pass_at_k.py` works on any machine | 1 GPU (smaller `--limit`) |

## Step 1: clone & install

```bash
git clone <anon-url> .
bash scripts/setup_vllm012_venv.sh        # default venv path: $HOME/verl-vllm012
source $HOME/verl-vllm012/bin/activate
export VLLM_VENV_PATH=$HOME/verl-vllm012  # picked up by the eval shell drivers
```

Python ≥ 3.10 is required; the script auto-detects a `uv`-installed Python
3.12 if present. Disk requirement: ~12 GB for the venv + 30 GB scratch for
vLLM/triton caches. First-run wheel downloads can take 10–20 min.

The setup script installs:

| Package | Pinned version | Notes |
|---|---|---|
| `verl` | 0.7.0 | from PyPI (no source build needed) |
| `vllm` | 0.12.0 | with PDL patch for SM100a |
| `torch` | 2.9.0+cu129 | matches vLLM 0.12.0 wheels |
| `flash-attn` | 2.8.3 | FA2 (FA3 not required) |
| `flashinfer-python` | ≥ 0.6.2 | precompiled SM90a kernels; SM100a JIT |
| `transformers` | 4.56.1 | |
| `peft` | 0.17.1 | LoRA |
| `ray` | 2.49.1 | distributed |
| `wandb` | 0.24.0 | optional |

The legacy stack documented in `requirements-verl.txt` (vLLM 0.10.1, PyTorch
2.7.1+cu128) is retained for reference but **must not** be used on
B200/SM100a — it produces 3× higher grad_norm during GSPO due to a vLLM
rollout divergence on Blackwell.

## Step 2: GPU-vendor environment variables

The training scripts set the following automatically; if you launch the
Python entry points directly, copy them yourself:

```bash
export VLLM_USE_V1=1                       # required by VERL ≥ 0.7
export VLLM_USE_TRTLLM_ATTENTION=0         # TRTLLM crashes on B200
export VLLM_ATTENTION_BACKEND=FLASH_ATTN   # only working backend on B200
export TORCHDYNAMO_DISABLE=1               # avoids triton autotuner CUDA issues
export PYTHONUNBUFFERED=1                  # flush logs immediately
export WANDB_CONSOLE=off                   # avoid stdout/stderr hijacking
unset PYTORCH_CUDA_ALLOC_CONF              # conflicts with vLLM V1 CuMemAllocator
```

H200 / SM90a does not need the TRTLLM/FlashInfer overrides but the env vars
are harmless when set.

## Step 3: secrets

The training scripts read `HF_TOKEN` and `WANDB_API_KEY` from the
environment. None are baked into this repo.

```bash
export HF_TOKEN=hf_...           # for OLMo-3 base + the anon-neurips26/* assets
export WANDB_API_KEY=...         # optional; omit and set WANDB_MODE=disabled
export WANDB_MODE=online         # or offline / disabled
```

## Step 4: smoke check

```bash
python -c "import vllm, torch, verl, peft; \
print('vLLM', vllm.__version__, 'PyTorch', torch.__version__, 'CUDA', torch.version.cuda)"

bash -n train/verl_sft/*.sh train/verl_grpo/*.sh   # parse-only check
```

Expected output:

```
vLLM 0.12.0 PyTorch 2.9.0 CUDA 12.9
```

If you hit `ImportError: vLLM not found`, the venv was not activated.
If you hit `gdc_wait()` errors at first kernel compile on a B200, re-run
`bash scripts/setup_vllm012_venv.sh` — the LoRA-PDL patch is idempotent.

## Step 5: verify the headline pipeline runs

A 1-step novelty-GSPO mini-run (≈ 15 min on 4 B200) end-to-end:

```bash
# Pull the SFT LoRA + merge in fp32 first (or train Stage 1 yourself).
huggingface-cli download anon-neurips26/olmo3-7b-puzzle-sft-ep5 \
    --local-dir checkpoints/sft_ep5
python src/verl_helpers/merge_lora.py \
    --base_model allenai/OLMo-3-7B-Instruct-SFT \
    --lora_path  checkpoints/sft_ep5 \
    --output_dir checkpoints/merged_ep5_fp32 \
    --torch_dtype float32

# Mini training-step smoke
TRAINER_TOTAL_TRAINING_STEPS=1 \
TRAIN_BATCH_SIZE=4 \
NUM_GENERATIONS=2 \
bash train/verl_grpo/novelty_production_gspo_28k_n4_galaxiesexact.sh
```

For a deeper sanity check, run a 4-prompt OlymMATH-Hard pass@32 eval
(≈ 8 min on 4 B200s):

```bash
python scripts/evals/lm_eval_dp_diverse.py \
    --model vllm \
    --model_args "pretrained=checkpoints/merged_ep5_fp32,data_parallel_size=4,gpu_memory_utilization=0.85,max_model_len=26000,trust_remote_code=True" \
    --include_path evaluate/custom_tasks \
    --tasks olymp_math_hard_pass32 \
    --apply_chat_template \
    --batch_size auto \
    --limit 4 \
    --seed 42 \
    --output_path results/_smoke \
    --log_samples
python scripts/evals/compute_pass_at_k.py \
    results/_smoke/<auto-generated subdir under results/_smoke/> \
    --k_values 1,8,32 --workers 4
```

> **`max_model_len` must exceed `max_gen_toks` from the YAML.** OlymMATH-Hard
> sets `max_gen_toks: 25000`; if `max_model_len` < 26000, lm_eval will
> truncate prompts to zero length and vLLM 0.12.0 will reject them with
> `ValueError: The decoder prompt cannot be empty`. Set `max_model_len`
> ≥ 26000 (or override `max_gen_toks` via `--gen_kwargs max_gen_toks=4096`
> for a fast smoke test). Use 4 GPUs with `data_parallel_size=4` so a
> 7B model + 26k context fits in 80 GB per GPU.

Verified locally on 4 × B200 with the SFT epoch-5 fp32-merged checkpoint:
4 problems × 32 rollouts in ~8 min, `compute_pass_at_k.py` parses without
error, pass@1/8/32 all reported.
