On the Scaling Theory of Multi-Layer Transformers

Chiwun Yang

On the Scaling Theory of Multi-Layer Transformers

Chiwun Yang

08 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: learning dynamics of transformer, theoretical analysis of neural scaling law, neural tangent kernel

Abstract: The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze one-pass stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data. We derive an excess risk of $\Theta(\mathsf{C}^{-1/8})$ for computational cost $\mathsf{C}$. The theory reveals a phase transition: under specific conditions, the generalization risk's upper bound drops sharply to $\exp(-\mathsf{C}^{1/4})$ before reverting to its original decay rate. This transition delineates three scaling regimes—*classical, over-parameterization, and data-limited*—which we analyze for their impact on scaling efficiency and the emergence of grokking.

Supplementary Material: zip

Primary Area: learning theory

Submission Number: 2972

Loading