Scaling with Collapse: Efficient and Predictable Training of LLM Families

ICLR 2026 Conference Submission13815 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Training loss curve collapse, Compute-efficient LLM pre-training, Tokens-per-parameter (TPP), AdamW EMA timescale, Learning-rate schedules, Scale-stable dynamics (μP), Early stopping for hyperparameter tuning
TL;DR: We show that loss curves *collapse* across LLM scales when training at fixed TPP and with AdamW timescale set optimally for that TPP, making collapse a marker of compute-efficient training and a tool for tuning, diagnostics, and early stopping.
Abstract: Effective LLM training relies on *consistency*, meaning that key quantities—such as final losses and optimal hyperparameters—scale predictably across model sizes. Qiu et al. (2025) recently showed that this consistency extends beyond scalars: whole training loss curves can *collapse* onto a universal trajectory after a simple normalization. What remains unclear is whether this phenomenon holds for LLM families trained under *practical scaling recipes*, where width, depth, learning rate, batch size, and weight decay are scaled jointly. We show that it does: loss curves collapse across scales precisely when optimization hyperparameters are set optimally for the given data budget, in accordance with recent empirical scaling laws. Collapse thus emerges as a signature of compute-efficient training. We demonstrate two applications at scale: (1) deviation-from-collapse provides a sensitive, early diagnostic of training pathologies, and (2) the predictability of collapsed curves enables early stopping in large-scale hyperparameter tuning. Finally, we train a competitive LLM family, *Celerity*, using these insights, highlighting collapse as an effective tool for developing efficient LLMs.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 13815
Loading