Keywords: LLM, OOD, Scaling Laws, Peformance Prediction, realiable evaluation
Abstract: As large language models (LLMs) continue to scale to billions of parameters, training them becomes increasingly expensive, making it infeasible to exhaustively explore the vast design space including model architectures, parameter sizes, and compute budgets. Scaling laws have therefore emerged as an essential tool for predicting the performance of larger models by extrapolating from smaller ones, enabling practitioners to make informed design choices without full‑scale training. However, existing approaches lack formal guarantees on the predicted results and overlook the out-of-distribution nature of such extrapolation, leading to high instability. We address these challenges with three key contributions. First, we introduce Equivalent Sample Size (ESS), a natural and principled metric that quantifies prediction uncertainty by translating it into the number of test samples required for direct, in-distribution evaluation. Second, we analyze how extrapolation amplifies prediction variance and develop an efficient algorithm that optimally allocates smaller-model evaluations to maximize ESS under compute budgets. Third, experiments on both simulated and real datasets show that ESS and our algorithm guide the design of scaling-law learning, cut evaluation cost, and deliver reliable LLM performance predictions.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10071
Loading