TL;DR: Loss curves from compute-optimally trained models collapse onto a universal shape, from which we can derive both theoretical insights and practical diagnostics for scaling.
Abstract: Understanding neural network training dynamics at scale is an important open problem. Although realistic model architectures, optimizers, and data interact in complex ways that make predictive theory challenging, we show that compute-optimally trained models exhibit remarkably precise collective regularities. Specifically, loss curves from models of varying sizes collapse onto a single universal curve when training compute and loss are normalized to unity at the end of training. With learning rate decay, discrepancies between normalized curves fall below the noise floor of individual models' loss curves across random seeds, yielding an exceptionally tight collapse we term "supercollapse." We observe supercollapse across learning rate schedules, datasets, and architectures, including transformers trained on next-token prediction. This collapse breaks down when hyperparameters are scaled suboptimally, providing a practical indicator of proper scaling. We explain these phenomena by connecting collapse to the power-law structure in typical neural scaling laws, and analyzing a simple but effective model of SGD noise dynamics that accurately captures how learning rate schedules deform loss curves away from power laws while preserving universality, and why learning rate decay suppresses variance to enable supercollapse.
Lay Summary: We find the loss curves of neural networks follow nearly identical shapes as they scale up in model size and training duration. We find evidence that this surprising phenomenon reveals valuable diagnostic information of neural network training dynamics at scale, and we provide some theoretical explanation of the mechanisms behind it.
Link To Code: https://github.com/shikaiqiu/supercollapse
Primary Area: Deep Learning
Keywords: Scaling Laws, Optimization
Submission Number: 13037
Loading