TL;DR: A guide to efficiently fit scaling laws, for example by using the whole loss trajectory
Abstract: Scaling laws predict the loss of a target machine learning model by extrapolating
from easier-to-train models with fewer parameters or smaller training sets. This
provides an efficient way for practitioners and researchers alike to compare pre-
training decisions involving optimizers, datasets, and model architectures. Despite
the widespread use of scaling laws to model the dynamics of language model
training, there has been little work on understanding how to best estimate and
interpret them. We collect (and release) a large-scale dataset containing losses and
downstream evaluations for 485 previously published pretrained models. We use
these to estimate more than 1000 scaling laws, then derive a set of best practices
for estimating scaling laws in new model families. We find that fitting scaling laws
to intermediate checkpoints of training runs (and not just their final losses) substan-
tially improves accuracy, and that—all else equal—estimates of performance are
generally most accurate when derived from other models of similar sizes. However,
because there is a significant degree of variability across model seeds, training
multiple small models is sometimes more useful than training a single large one.
Moreover, while different model families differ in scaling behavior, they are often
similar enough that a target model’s behavior can be predicted from a single model
with the same architecture, along with scaling parameter estimates derived from
other model families.
Lay Summary: Training large language models is expensive and time-consuming, so researchers often try to predict how a model will perform before fully training it. One common technique is using "scaling laws," which are mathematical formulas that estimate how a model’s performance changes as you increase the size of the model or the training data. These predictions help guide decisions like which architecture to use or how much data to collect.
In this paper, we take a closer look at how to make these predictions more accurate and useful. We gather a large dataset of nearly 500 language models and analyze how well different scaling law methods work. We discover that using not just the final results of model training but also data from earlier training steps leads to better predictions. We also find that comparing models of similar sizes gives more reliable results, and that training a few small models can sometimes give you more insight than training one big one. Even though different types of models scale differently, we show that it’s often possible to predict how a new model will behave based on similar models and a few smart estimates. Our work offers practical tips for researchers trying to design and train better models more efficiently.
Primary Area: Deep Learning->Large Language Models
Keywords: Scaling laws, recycling, efficiency, LLMs, Large Language Models, NLP, Natural Lagnuage Processing
Submission Number: 11149
Loading