Pretraining Scaling Laws for Generative Evaluations of Language Models

Rylan Schaeffer; Noam Itzhak Levi; Brando Miranda; Sanmi Koyejo

Pretraining Scaling Laws for Generative Evaluations of Language Models

Rylan Schaeffer, Noam Itzhak Levi, Brando Miranda, Sanmi Koyejo

Published: 17 Oct 2025, Last Modified: 21 Nov 2025MATH-AI 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: language models, large language models, scaling laws, evaluations, generative evaluations, sampling

TL;DR: Scaling laws for generative evals of language models during pretraining

Abstract: Neural scaling laws have played a central role in modern machine learning, driving the field's ever-expanding scaling of parameters, data and compute. While much research has gone into fitting scaling laws and predicting performance of pretraining losses and on \emph{discriminative} evaluations such as multiple-choice question-answering benchmarks, comparatively less research has been done on fitting scaling laws and predicting performance on \emph{generative} evaluations such as mathematical problem-solving or coding. In this work, we propose and evaluate three pretraining scaling laws for fitting pass-at-$k$ on generative evaluations and for predicting pass-at-$k$ of the most expensive model. Our three scaling laws differ in the covariates used: (1) pretraining compute, (1) model parameters and pretraining tokens, (2) log likelihoods of gold reference solutions. We make three main contributions: First, we show how generative evaluations offer new hyperparameters (in our setting, $k$) that researchers can use to control the scaling laws parameters and the predictability of performance. Second, in terms of scaling law parameters, we find that the compute scaling law and parameters\,+\,tokens scaling law stabilize for the last $\mathord{\sim}1.5{-}2.5$ orders of magnitude, whereas the gold reference likelihood scaling law stabilizes for the last $\mathord{\sim}5$ orders of magnitude. Third, in terms of predictive performance, we find all three scaling laws perform comparably, although the compute scaling law predicts slightly worse for small $k$ and the log likelihoods of gold reference solutions predicts slightly worse for large $k$. % Fourth, we establish a theoretical connection that the compute scaling law emerges as the compute-optimal envelope of the parameters-and-tokens scaling law. Our framework provides researchers and practitioners with insights and methodologies to forecast generative performance, accelerating progress toward models that can reason, solve, and create.

Supplementary Material: pdf

Submission Number: 183

Loading