Training Dynamics of Softmax Self-Attention: Fast Global Convergence via Preconditioning

Published: 29 May 2026, Last Modified: 29 May 2026HiLD at ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: self-attention, global convergence
TL;DR: We propose a new optimization algorithm which quickly finds the globally optimal self-attention parameters in a regression setting. Abstract:
Abstract: We study the training dynamics of gradient descent in a softmax self-attention layer trained to perform linear regression and propose a first-order optimization algorithm which converges to the globally optimal self-attention parameters at a geometric rate. Our analysis proceeds in two steps. First, we show that in the infinite-data limit the regression problem solved by the self-attention layer is equivalent to a nonconvex matrix factorization problem. Second, we exploit this connection to design a novel ``structure-aware'' variant of gradient descent which efficiently optimizes the original finite-data regression objective. Our optimization algorithm features several innovations over vanilla gradient descent, including a data-dependent preconditioner and a scale-invariant regularizer which help avoid spurious stationary points, a renormalization step which ensures that the softmax parameters remain bounded, and a spectral initialization of parameters which lie near the manifold of global minima with high probability. We prove that the generalization error of the model trained by our algorithm decreases exponentially fast in the number of gradient descent iterations, up to an additional error term that decreases as $1/n$ in the size of the training set.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 153
Loading