# Supplementary code: *When RL Suppresses Its Own Vocabulary*

Anonymous code release for the NeurIPS 2026 submission **"When RL Suppresses
Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math
Transfer."** This repository covers the headline pipeline: SFT on
DSR-distilled puzzle traces, vanilla GSPO RL, and the novelty-bonus GSPO
variant that lifts OlymMATH-Hard pass@32 from 16.0% (base) to 36.0%.

The repository is **fully self-contained for code and data**: every parquet
needed to run Stage 1 SFT and Stage 2 RL ships inline under `data/`. The only
external download is the public OLMo-3-7B-Instruct-SFT base from
`allenai/OLMo-3-7B-Instruct-SFT`. **No model checkpoints are bundled** —
reproducing the headline pass@k numbers requires running SFT and RL from
scratch (~944 GPU-hours total on B200; see compute notes below).

> **Anonymity.** All identifiers, paths, and tokens have been redacted. The
> repository depends only on public HF resources (OLMo-3 base, public math
> benchmarks) — there is no anonymized HF org, no proprietary cluster
> dependency. Identifier sweeps for personal HF usernames, author surnames,
> institutional hints, S3 buckets, WandB entity strings, and literal API
> tokens all return zero hits at packaging time.

## Repository layout

```
.
├── INSTALL.md                     environment setup (uv + verl-vllm012 venv)
├── HF_ASSETS.md                   bundled-data manifest + public-HF deps
├── LICENSES.md                    third-party credits
├── train/
│   ├── verl_sft/multi_puzzle_dsr_olmo3_v2.sh
│   └── verl_grpo/
│       ├── multi_puzzle_gspo_olmo3_v2_sft_ep3.sh
│       └── novelty_production_gspo_28k_n4_galaxiesexact.sh
├── src/verl_helpers/              VERL entry points + LoRA / HF helpers
├── reward_function/               puzzle scorers (bridges, pattern, undead, galaxies) + format reward
├── prompts/                       puzzle prompt templates
├── scripts/
│   ├── evals/                     lm_eval_dp_diverse.py + compute_pass_at_k.py + drivers
│   ├── analysis/                  Section 4–9 figure scripts
│   └── setup_vllm012_venv.sh      builds the verl-vllm012 venv
├── analysis/exploration/          v90 primitive classifier framework + pipeline
├── evaluate/custom_tasks/         lm-eval task YAMLs (OlymMATH, AIME, HMMT, OMEGA, puzzles)
├── tools/eval_lora_checkpoints.py SFT-time per-checkpoint eval driver
└── data/
    ├── rl/  train_combined_v2_sft_ep3.parquet  Stage-2 RL corpus (9.2 MB)
    ├── sft/ <6 puzzle parquets>                Stage-1 SFT corpus (~70 MB)
    └── eval/<puzzle>/<size>_test200.parquet    held-out OOD puzzle evals
```

## Mapping paper claims to scripts

| Paper claim / figure | Script(s) |
|---|---|
| OlymMATH-Hard pass@32: base 16.0% (Table 1, Fig. 1) | `scripts/evals/eval_olmo3_base_puzzles_pass32.sh` (puzzle pass@32 baseline); `evaluate/custom_tasks/olymp_math/olymp_math_hard_pass32.yaml` for the math eval |
| SFT 23.0% | `train/verl_sft/multi_puzzle_dsr_olmo3_v2.sh` then evaluate |
| Vanilla GSPO 29.0% | `train/verl_grpo/multi_puzzle_gspo_olmo3_v2_sft_ep3.sh` |
| Novelty GSPO 36.0% | `train/verl_grpo/novelty_production_gspo_28k_n4_galaxiesexact.sh` (entry point: `src/verl_helpers/train_main_novelty.py`, top-k=100, α=0.1, z-clip=2) |
| Bridges 8×8 / Undead 5×5 / Pattern 5×5 grid extrapolation (Fig. 1 bottom) | `scripts/evals/eval_novelty_prod_s15_puzzles.sh`, task YAMLs under `evaluate/custom_tasks/{bridges,undead,pattern}_puzzle/` |
| HMMT, OMEGA pass@k (Fig. 1 top) | `scripts/evals/eval_hmmt_pass32.sh`, `eval_omega_pass32.sh` |
| AIME24 / AIME25 (Appx) | `scripts/evals/eval_aime_pass32.sh` |
| pass@k aggregation (any benchmark) | `scripts/evals/compute_pass_at_k.py --workers 8` |
| Sec 4 primitive classifier (v90) | `analysis/exploration/primitive_classification.py`; configs `analysis/exploration/configs/v90_*.yaml`; weights at `anon-neurips26/v90-primitive-classifier` |
| Sec 4–9 figures (PALETTE / EDGE) | `scripts/analysis/section6_7_figures.py`, `section6_motif_examples.py`, `section4_1_primitive_motif_distributions.py`, `plot_per_problem_progression.py`, `plot_puzzle_passk.py`, `recovery_k_sensitivity.py`, `novelty_signal_analysis.py`, `within_problem_paired.py` |
| Novelty bonus algorithm (Appx, Fig. alg:novelty) | `src/verl_helpers/train_main_novelty.py` (rollout-time top-k NLL → within-prompt z-score → α·z added to last-valid-token reward) |

## Reproduction recipe

> **Hardware.** SFT runs on 4 GPUs (B200 / H200 / A100 80 GB). Vanilla GSPO
> uses 8×B200; the novelty production run uses 4×B200 (each at the same
> per-GPU rollout load, n_gen=8 vs. n_gen=4). End-to-end wall-clock budget on
> the suggested hardware is ≈ 36 hr SFT + ≈ 50 hr per RL stage.

### 0. Set up the environment

```bash
bash scripts/setup_vllm012_venv.sh   # builds $HOME/verl-vllm012
source $HOME/verl-vllm012/bin/activate
export VLLM_VENV_PATH=$HOME/verl-vllm012   # used by every script in scripts/evals/
```

See [`INSTALL.md`](INSTALL.md) for full prerequisites. B200/SM100a GPUs need
the bundled vLLM LoRA-PDL patch (auto-applied by the training scripts and
the setup script).

### 1. Stage 1 — SFT on rejection-sampled DSR puzzle traces

```bash
bash train/verl_sft/multi_puzzle_dsr_olmo3_v2.sh
```

Trains a LoRA (rank 64, α 64, target=`all-linear`) on top of
`allenai/OLMo-3-7B-Instruct-SFT`. **Endpoint = epoch 5.** The training
corpus (~70 MB across six puzzle splits) is bundled in `data/sft/`; the
script reads from those local parquets directly, no HF download needed.

### 2. Merge SFT epoch 5 in fp32 (critical for hard-math accuracy)

```bash
python src/verl_helpers/merge_lora.py \
    --base_model allenai/OLMo-3-7B-Instruct-SFT \
    --lora_path  checkpoints/sft_ep5 \
    --output_dir checkpoints/merged_ep5_fp32 \
    --torch_dtype float32
```

bf16 LoRA-merge degrades AIME24 pass@1 (10% → 6.7%); fp32 must be explicit
because Qwen-style `torch_dtype=auto` resolves to bf16. vLLM down-casts to
bf16 at rollout time anyway.

### 3. Stage 2a — vanilla GSPO

```bash
bash train/verl_grpo/multi_puzzle_gspo_olmo3_v2_sft_ep3.sh
```

8×B200, 1 epoch (≈ 71 steps), batch 128 prompts × 8 rollouts. Init = step 0
of the SFT-merged checkpoint. Same checkpoint serves as the KL reference
(β = 1e-3).

### 4. Stage 2b — novelty-bonus GSPO

```bash
bash train/verl_grpo/novelty_production_gspo_28k_n4_galaxiesexact.sh
```

4×B200, 1 epoch, batch 128 × n_gen 4. The bonus is computed in
`src/verl_helpers/train_main_novelty.py`:

1. For each correct rollout, take the top-100 most surprising tokens under
   the frozen SFT reference and average their NLLs.
2. Within each prompt group, z-score the resulting per-rollout signal
   (clip to ±2).
3. Add α·z (α = 0.1) at the last valid token of each correct rollout
   *before* GSPO advantage computation.

Hyperparameters are exposed via env vars (`NOVELTY_ALPHA`, `NOVELTY_TOPK`,
`NOVELTY_Z_CLIP`, `NOVELTY_USE_SUM`, `NOVELTY_ALPHA_DECAY`). All other GSPO
machinery is unchanged.

### 5. Evaluation

The wrapper `scripts/evals/lm_eval_dp_diverse.py` is **mandatory** for any
sampled (pass@k) eval — it diversifies the per-prompt vLLM seed so the
`repeats > 1` rollouts are not duplicates. Greedy evals can use raw
`lm_eval` directly.

```bash
# OlymMATH-Hard pass@32 against the novelty production checkpoint at step 15
bash scripts/evals/eval_novelty_prod_s15_olymp.sh
python scripts/evals/compute_pass_at_k.py \
    results/novelty_prod_s15_math_eval_diverse/novelty_s15/ \
    --k_values 1,8,32 --workers 8
```

`compute_pass_at_k.py` uses `math_verify` for math tasks; reward functions
under `reward_function/` for puzzle tasks. The `--workers 8` flag yields a
~4× speedup on pass@64 scoring.

### 6. Sec 4–9 primitive analysis

The 9-class span classifier is a fine-tuned encoder model. The training
pipeline ships under `analysis/exploration/llm_validation/classifier/`
and the inference adapter is in
`analysis/exploration/primitive_classification.py`. **Trained classifier
weights are not bundled** — retrain via the pipeline scripts, or skip the
primitive analysis (it is not required to reproduce the headline pass@k
numbers).

```bash
# Once a classifier checkpoint exists at <CLASSIFIER_PATH>:
python -m analysis.exploration.pipeline analysis/exploration/configs/v90_prod_s15_math.yaml
```

This produces the per-trace primitive sequences and motif counts used by
the figure scripts in `scripts/analysis/section{4,6,7}_*.py`.

## NeurIPS checklist evidence

| Item | Evidence in this repo / paper |
|---|---|
| #4 Reproducibility | full training scripts in `train/`, bundled training corpora under `data/`, hyperparameter tables in the paper appendix (`tab:hyper-sft`, `tab:hyper-rl`) |
| #5 Open access | this repo (code + data); only public HF assets used externally (`allenai/OLMo-3-7B-Instruct-SFT`, `RUC-AIBOX/OlymMATH`, etc.) |
| #6 Experimental settings | every Hydra flag is exposed in `train/.../*.sh`; the appendix tables are the canonical reference |
| #8 Compute resources | 4–8× B200 (or H200) 80–144 GB, ≈ 30–50 hr per stage; `INSTALL.md` lists the venv build cost |
| #12 Licenses | see `LICENSES.md` |
| #13 New assets | dataset parquets bundled in `data/` are derived from public puzzle generators (Simon Tatham collection, MIT) — see `LICENSES.md` |

## Caveats

* **Determinism.** vLLM rollouts are not bit-deterministic across
  driver/CUDA/hardware combinations; pass@k numbers should match within
  ±1 pp at the seeds shipped here.
* **Eval token budget.** Always set `--max_new_tokens` to match the training
  `MAX_RESPONSE_LENGTH` (28 000 for the headline runs); a smaller budget
  truncates reasoning and silently zeros pass@k on solvable problems.
* **OlymMATH scorer.** The pre-2026-04-28 `olymp_math/utils.py` substring
  fallback over-credited predictions; the bundled version uses exact-match
  after normalisation. Older numbers from prior reports should not be mixed
  with results produced by this code.
