A Hitchhiker's Guide to Scaling Law Estimation

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC0 1.0
TL;DR: A guide to efficiently fit scaling laws, for example by using the whole loss trajectory
Abstract: Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets. This provides an efficient way for practitioners and researchers alike to compare pre- training decisions involving optimizers, datasets, and model architectures. Despite the widespread use of scaling laws to model the dynamics of language model training, there has been little work on understanding how to best estimate and interpret them. We collect (and release) a large-scale dataset containing losses and downstream evaluations for 485 previously published pretrained models. We use these to estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families. We find that fitting scaling laws to intermediate checkpoints of training runs (and not just their final losses) substan- tially improves accuracy, and that—all else equal—estimates of performance are generally most accurate when derived from other models of similar sizes. However, because there is a significant degree of variability across model seeds, training multiple small models is sometimes more useful than training a single large one. Moreover, while different model families differ in scaling behavior, they are often similar enough that a target model’s behavior can be predicted from a single model with the same architecture, along with scaling parameter estimates derived from other model families.
Lay Summary: Training large language models is expensive and time-consuming, so researchers often try to predict how a model will perform before fully training it. One common technique is using "scaling laws," which are mathematical formulas that estimate how a model’s performance changes as you increase the size of the model or the training data. These predictions help guide decisions like which architecture to use or how much data to collect. In this paper, we take a closer look at how to make these predictions more accurate and useful. We gather a large dataset of nearly 500 language models and analyze how well different scaling law methods work. We discover that using not just the final results of model training but also data from earlier training steps leads to better predictions. We also find that comparing models of similar sizes gives more reliable results, and that training a few small models can sometimes give you more insight than training one big one. Even though different types of models scale differently, we show that it’s often possible to predict how a new model will behave based on similar models and a few smart estimates. Our work offers practical tips for researchers trying to design and train better models more efficiently.
Primary Area: Deep Learning->Large Language Models
Keywords: Scaling laws, recycling, efficiency, LLMs, Large Language Models, NLP, Natural Lagnuage Processing
Submission Number: 11149
Loading