Abstract: We present PRISM, a comprehensive empirical study of mid-training design choices for large language
models (LLMs). Through controlled experiments across seven base models spanning four families
(Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention-
Mamba hybrid), and scales from 3B to 24B parameters, we show that a mid-training phase of∼27B
high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code,
and +6 to +13 points on science (GPQA-Diamond) benchmarks while preserving general performance.
The full PRISM →RL pipeline improves the macro-average (domain-weighted) across six reasoning
benchmarks from under 12 to 29–42 (a 3–4×improvement), whereas RL applied directly to most of the
base models remains substantially less effective, with AIME scores near zero. Data composition choices
matter most at mid-training, not at RL: including science data during mid-training unlocks +17 to +28
point GPQA-Diamond gains during RL, while changing the RL mix produces <2 point differences.
Mechanistically, mid-training densely restructures >90% of model weights, while RL makes sparse,
front-loaded refinements to∼5% of parameters. Representation analysis (CKA) across three models
and three input distributions confirms that RL consistently preserves mid-training’s representational
geometry (>0.998 CKA) across both dense Transformers and hybrid architectures. Crucially, RL
applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models,
consistent with mid-training placing the model in a weight configuration from which RL can effectively
improve performance. Our results demonstrate that retention-aware mid-training is a highly effective
intermediate step for reliable reasoning enhancement and provide practical guidance for designing
robust mid-training pipelines.
Loading