Understanding Softmax Attention Layers:\\ Exact Mean-Field Analysis on a Toy Problem

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: softmax self-attention; theory of transformers; statistical physics; optimization dynamics
TL;DR: We prove that a softmax self-attention layer trained via GD can solved the so-called single-locator regression problem
Abstract: Self-attention has emerged as a fundamental component driving the success of modern transformer architectures, which power large language models and various applications. However, a theoretical understanding of how such models actually work is still under active development. The recent work of (Marion et al., 2025) introduced the so-called "single-location regression" problem, which can provably be solved by a simplified self-attention layer but not by linear models, thereby demonstrating a striking functional separation. A rigorous analysis of self-attention with softmax for this problem is challenging due to the coupled nature of the model. In the present work, we use ideas from the classical random energy model in statistical physics to analyze softmax self-attention on the single-location problem. Our analysis yields exact analytic expressions for the population risk in terms of the overlaps between the learned model parameters and those of an oracle. Moreover, we derive a detailed description of the gradient descent dynamics for these overlaps and prove that, under broad conditions, the dynamics converge to the unique oracle attractor. Our work not only advances our understanding of self-attention but also provides key theoretical ideas that are likely to find use in further analyses of even more complex transformer architectures.
Primary Area: Theory (e.g., control theory, learning theory, algorithmic game theory)
Submission Number: 26506
Loading