Keywords: Optimization, Scaling Laws
TL;DR: Compute-optimally trained neural networks have highly consistent learning curves across model scales.
Abstract: Studies of scaling ladders have shown that the compute-optimal Pareto frontier of a family of loss curves can have a predictable shape, often a power law. We use a series of small transformer models to demonstrate that the full loss curves themselves have a consistent shape — collapsing onto a single universal curve after an affine rescaling. Surprisingly, the deviations in the rescaled curves across model sizes are smaller than deviations induced by randomn initialization and data ordering in the raw loss curves, a phenomenon we call supercollapse. We recreate this phenomenon in a simplified setting of training MLPs on a synthetic regression dataset. By analyzing both the original model and our simplified model, we identify necessary conditions for supercollapse, including compute-optimal training, learning rate decay, and a power-law compute-loss Pareto frontier, and demonstrate its sensitivity to the estimate of the irreducible loss. Our study hints at a broader, dynamical universality induced by compute-optimal scaling procedures.
Submission Number: 78
Loading