Predicting Large Model Test Losses with a Noisy Quadratic System

Chuning Li; Chris J. Maddison

Predicting Large Model Test Losses with a Noisy Quadratic System

Chuning Li, Chris J. Maddison

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 regularEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Predicting large model test loss using model size, batch size, and number of weight updates.

Abstract: We introduce a predictive model that estimates the pre-training loss of large models from model size ($N$), batch size ($B$) and number of weight updates ($K$). This is the first loss prediction model that can handle changing batch size. The model outperforms Chinchilla's loss model, a model of the test loss using the batch size and number of tokens, in terms of projecting the loss at extrapolated compute budgets (up to 1000 folds). A natural use of the model is to find optimal $N,B,K$ configurations under explicit and compound resource constraints like time, memory and compute. In our experiments, the model-selected configurations are close to ground-truth optimal. Our work advocates for loss prediction as a better alternative to heuristic-based laws, which are growing in complexity. The implementation is available on \href{https://github.com/chuningxdy/Noisy-Quadratic-System}{GitHub}.

Lay Summary: Training large AI models is expensive, so deciding how big a model should be, how much data it should process at once, and how long it should train can involve a lot of costly trial and error. Existing scaling laws such as Chinchilla offer helpful guidance, but are less flexible in accounting for practical training choices and can be less reliable when predicting much larger training runs. This paper introduces a predictive model that estimates training performance directly from a few key design choices, which helps researchers plan efficient training runs under real-world constraints like compute, memory, and time. This approach could make AI development less wasteful.

Link To Code: https://github.com/chuningxdy/Noisy-Quadratic-System

Primary Area: Deep Learning->Large Language Models

Keywords: test loss, scaling laws, pre-training, noisy quadratic model, Chinchilla, batch size, large language models

Originally Submitted PDF: pdf

Submission Number: 18024

Loading