LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Pretraining data distribution, rather than model architecture or other factors, determines the loss-to-loss scaling behavior of large language models⁠.
Abstract: Scaling laws guide the development of large language models (LLMs) by offering estimates for the optimal balance of model size, tokens, and compute. More recently, loss-to-loss scaling laws that relate losses across pretraining datasets and downstream tasks have emerged as a powerful tool for understanding and improving LLM performance and generalization. In this work, we investigate which factors most strongly influence loss-to-loss scaling. Our experiments reveal that the pretraining data determines the scaling trend. In contrast, model size, optimization hyperparameters, tokenizer and even significant architectural differences, such as between transformer-based models like Llama and state-space models like Mamba, generally have limited impact. Consequently, practitioners should carefully curate pretraining datasets for optimal downstream performance, while architectures and other settings can be freely optimized for training efficiency.
Lay Summary: Our work indicates that if two models with different training setups (architecture, context length, tokenizer, etc.) but trained on the same data—achieve similar training losses, they will exhibit closely matched downstream test performance.
Link To Code: https://github.com/brendel-group/llm-line
Primary Area: Deep Learning->Large Language Models
Keywords: LLMs, scaling laws, data-centric ML, generalization
Submission Number: 12103
Loading