Learning In-context $n$-grams with Transformers: Sub-$n$-grams Are Near-Stationary Points

Published: 01 May 2025, Last Modified: 15 Aug 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: For the next token population cross entropy loss over in-context $n$-grams, $k$-grams ( for $k \leq n$ ) are near stationary points.
Abstract: Motivated by empirical observations of prolonged plateaus and stage-wise progression during training, we investigate the loss landscape of transformer models trained on in-context next-token prediction tasks. In particular, we focus on learning in-context $n$-gram language models under cross-entropy loss, and establish a sufficient condition for parameter configurations to be stationary points. We then construct a set of parameter configurations for a simplified transformer model that represent $k$-gram estimators (for $k \leq n$), and show that the gradient of the population loss at these solutions vanishes in the limit of infinite sequence length and parameter norm. This reveals a key property of the loss landscape: {sub-$n$-grams are near-stationary points of the population cross-entropy loss}, offering theoretical insight into widely observed phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by numerical experiments that illustrate the learning dynamics of $n$-grams, characterized by discrete transitions between near-stationary solutions.
Lay Summary: When training AI language models, the learning happens in clear steps - like levels in a game. At each level, the model picks up new skills. To investigate this, we analyze the training dynamics of a simplified transformer model applied to a basic yet mathematically well-characterized n-gram language model. We demonstrate the existence of non-trivial partial solutions where the gradient vanishes, inhibiting further training progress and thereby producing the observed step-wise learning behavior.
Link To Code: https://github.com/tml-epfl/sub-n-grams-are-stationary.
Primary Area: Deep Learning->Theory
Keywords: in context learning, markov chains, transformers, n-grams
Submission Number: 6251
Loading