Keywords: Large Language Models, Downstream Metrics, Pretraining, Evaluation, Benchmarks, LLM
Abstract: While scaling laws for Large Language Models (LLMs) traditionally focus on proxy metrics like pretraining loss, predicting downstream task performance has been considered unreliable. This paper challenges that view by proposing a direct framework to model the scaling of downstream accuracy from the training budget. We demonstrate that for a fixed token-to-parameter ratio, a simple two-parameter scaling law accurately describes this relationship. Our findings are validated by experiments on models with up to 17B parameters trained on up to 350B tokens, showing that downstream capabilities scaling can be described using a scaling law. Furthermore, we extend this framework to extrapolate and predict accuracy of target model with up to 6.7x larger training budget based on a set of smaller experiments. We will release a complete list of model losses and downstream evaluation results at various different scales to support reproducibility and encourage future research.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21172
Loading