PRISM: Demystifying Retention and Interaction in Mid-Training

Bharat Runwal, Ashish Sunil Agrawal, Anurag Roy, Rameswar Panda

Published: 16 Mar 2026, Last Modified: 31 Mar 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: We present PRISM, a comprehensive empirical study of mid-training design choices for large language models (LLMs). Through controlled experiments across seven base models spanning four families (Granite, LLaMA, Mistral, Nemotron-H), two architecture types (dense Transformer and attention- Mamba hybrid), and scales from 3B to 24B parameters, we show that a mid-training phase of∼27B high-quality tokens yields consistent gains of +15 to +40 points on math, +5 to +12 points on code, and +6 to +13 points on science (GPQA-Diamond) benchmarks while preserving general performance. The full PRISM →RL pipeline improves the macro-average (domain-weighted) across six reasoning benchmarks from under 12 to 29–42 (a 3–4×improvement), whereas RL applied directly to most of the base models remains substantially less effective, with AIME scores near zero. Data composition choices matter most at mid-training, not at RL: including science data during mid-training unlocks +17 to +28 point GPQA-Diamond gains during RL, while changing the RL mix produces <2 point differences. Mechanistically, mid-training densely restructures >90% of model weights, while RL makes sparse, front-loaded refinements to∼5% of parameters. Representation analysis (CKA) across three models and three input distributions confirms that RL consistently preserves mid-training’s representational geometry (>0.998 CKA) across both dense Transformers and hybrid architectures. Crucially, RL applies identical weight changes regardless of starting point, yet only succeeds on mid-trained models, consistent with mid-training placing the model in a weight configuration from which RL can effectively improve performance. Our results demonstrate that retention-aware mid-training is a highly effective intermediate step for reliable reasoning enhancement and provide practical guidance for designing robust mid-training pipelines.