Learning In-context $n$-grams with Transformers: Sub-$n$-grams Are Near-Stationary Points

Aditya Varre; Gizem Yüce; Nicolas Flammarion

Learning In-context $n$-grams with Transformers: Sub-$n$-grams Are Near-Stationary Points

Aditya Varre, Gizem Yüce, Nicolas Flammarion

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: For the next token population cross entropy loss over in-context $n$-grams, $k$-grams ( for $k \leq n$ ) are near stationary points.

Abstract: In this article, we explore the loss landscape of next-token prediction with transformers. Specifically, we focus on learning in-context n-gram language models with cross-entropy loss using a simplified two-layer transformer. We design a series of transformers that represent $k$-grams (for $k \leq n$) for which the gradient of the population loss approaches zero in the limit of both infinite sequence length and infinite parameter norm. This construction reveals a key property of the loss landscape: \emph{$k$-grams are stationary points of the population cross-entropy loss}, offering theoretical insights for widely observed empirical phenomena such as stage-wise learning dynamics and emergent phase transitions. These insights are further supported by comprehensive numerical experiments that illustrate the dynamics of learning $n$-grams, characterized by jumps between stationary points.

Lay Summary: When training AI language models, the learning happens in clear steps - like levels in a game. At each level, the model picks up new skills. To investigate this, we analyze the training dynamics of a simplified transformer model applied to a basic yet mathematically well-characterized n-gram language model. We demonstrate the existence of non-trivial partial solutions where the gradient vanishes, inhibiting further training progress and thereby producing the observed step-wise learning behavior.

Link To Code: https://github.com/tml-epfl/sub-n-grams-are-stationary.

Primary Area: Deep Learning->Theory

Keywords: in context learning, markov chains, transformers, n-grams

Submission Number: 6251

Loading