On the Generalization Ability of Next-Token-Prediction Pretraining

Zhihao Li; Xue Jiang; Liyuan Liu; Xuelin Zhang; Hong Chen; Feng Zheng

On the Generalization Ability of Next-Token-Prediction Pretraining

Zhihao Li, Xue Jiang, Liyuan Liu, Xuelin Zhang, Hong Chen, Feng Zheng

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model's generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences $N$, the maximum length of sequence $m$, and the count of parameters in the transformer model $\Theta$. Additionally, experiments on public datasets verify our theoretical findings.

Lay Summary: Large language models (LLMs) like ChatGPT excel by predicting the next word, but we lack a theoretical understanding of why this simple training method gives them such powerful generalization abilities. This knowledge gap prevents us from fundamentally grasping how these models work or reliably improving them. We developed a novel mathematical framework to quantify how three key factors—training data volume ($N$), text sequence length ($m$), and model size ($\Theta$)—collectively shape generalization. Our approach analyzed the model’s learning process while accounting for complex dependencies between words. We validated this theory through experiments on real-world language datasets. This work is to mathematically explain how next-word prediction training enables generalization in LLMs, effectively decoding their "learning mechanism." These insights allow developers to build more efficient, reliable models with less trial-and-error, and lay the groundwork for safer, more interpretable AI systems in the future.

Primary Area: Theory->Learning Theory

Keywords: Next Token Prediction; Decoder Only Models; Generalization Bounds

Submission Number: 2922

Loading