Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Shai Zucker; Xiong Wang; Fei Lu; Inbar Seroussi

Minimax Rates for Learning Pairwise Interactions in Attention-Style Models

Shai Zucker, Xiong Wang, Fei Lu, Inbar Seroussi

Published: 26 Jan 2026, Last Modified: 11 Apr 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Attention mechanism, Interacting particle systems, Minimax rates, Nonparametric estimation

TL;DR: Analyzing attention mechanisms as interacting particle systems, we prove that learning pairwise token interactions achieves a minimax scalar rate of $M^{-\frac{2\beta}{2\beta+1}}$ when $M$ is large enough.

Abstract: We study the convergence rate of learning pairwise interactions in single-layer attention-style models, where tokens interact through a weight matrix and a nonlinear activation function. We prove that the minimax rate is $M^{-\frac{2\beta}{2\beta+1}}$, where $M$ is the sample size and $\beta$ is the H\"older smoothness of the activation function. Importantly, this rate is independent of the embedding dimension $d$, the number of tokens $N$, and the rank $r$ of the weight matrix, provided that $rd \le (M/\log M)^{\frac{1}{2\beta+1}}$. These results highlight a fundamental statistical efficiency of attention-style models, even when the weight matrix and activation are not separately identifiable, and provide a theoretical understanding of attention mechanisms and guidance on training.

Primary Area: learning theory

Submission Number: 12456

Loading