Understanding Scaling Laws With Token-Level Analysis

Arkil Patel; Marius Mosbach; Siva Reddy; Dzmitry Bahdanau

Understanding Scaling Laws With Token-Level Analysis

Arkil Patel, Marius Mosbach, Siva Reddy, Dzmitry Bahdanau

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Scaling Laws

TL;DR: We attempt to understand power-law scaling behaviour during pretraining by studying loss trajectories of individual tokens

Abstract: Large language models (LLMs) exhibit remarkably smooth power-law scaling of cross-entropy loss with respect to training data size (tokens). Despite the robustness of this empirical law, its mechanistic origins remain unclear: why do the dynamics of gradient-based training yield such a clean functional form? This paper attacks the question through a token-level lens. We pretrain OLMo-2 across multiple token budgets and log per-token validation losses across checkpoints. We find (i) individual token-loss trajectories are highly heterogeneous and often noisy, with no obvious shared parametric form; (ii) nevertheless, best-fit power laws typically imply decreasing trends, and a substantial fraction of tokens are well-approximated by power laws; and (iii) the canonical aggregate power law emerges only after averaging over a critical mass of token-level losses, with fit error decaying rapidly as coverage increases. Our findings show that power-law scaling is not a universal property of each token's learning curve, but an emergent phenomenon arising from averaging many heterogeneous token-wise dynamics.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Style Files: I have used the style files.

Submission Number: 29

Loading