Keywords: scaling laws, compute-optimal pre-training, noisy quadratic model, Chinchilla, batch size, large language models
Abstract: Pre-training scaling laws describe the best training decisions under resource constraints. The discovery of new laws is a demanding exercise, as each decision requires a separate law. An alternative is to model the scaling dynamics of LLMs directly, then use those models as surrogates for multiple decisions. Yet, most theoretical models of scaling dynamics cannot be fit to scaling data easily. In this paper, we introduce the Noisy Quadratic System (NQS), a fittable relative of the theoretical models that can generate new scaling laws. We also identify some key failure modes in the theoretical models, and further extend the NQS to correct for these deficiencies. In our experiments, our best model, fit on small-scale runs, closely predicted the performance of runs near critical points, which Chinchilla failed to do. Finally, the NQS is the first practical scaling model to include a variance term, which allows us to model the effect of batch size. Because of this, it may help practitioners configure training under many resource constraints, including compute, but also time and memory.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 20225
Loading