A multiscale analysis of mean-field transformers in the moderate interaction regime

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mean-field limits, moderate interaction, mean-field transformers, self-attention models, clustering, multiscale
Abstract: In this paper, we study the evolution of tokens through the depth of encoder-only transformer models at inference time by modeling them as a system of particles interacting in a mean-field way and studying the corresponding dynamics. More specifically, we consider this problem in the moderate interaction regime, where the number $N$ of tokens is large and the inverse temperature parameter $\beta$ of the model scales together with $N$. In this regime, the dynamics of the system displays a multiscale behavior: a fast phase, where the token empirical measure collapses on a low-dimensional space, an intermediate phase, where the measure further collapses into clusters, and a slow one, where such clusters sequentially merge into a single one. We provide a rigorous characterization of the limiting dynamics in each of these phases and prove convergence in the above mentioned limit, exemplifying our results with some simulations.
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 5358
Loading