Inductive Bias and Spectral Properties of Single-Head Attention in High Dimensions

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: High-dimensional statistics; Overparameterization; Attention mechanisms; Inductive bias; Random matrix theory; Approximate message passing; Generalization; Spectral analysis; Compressed Sensing
TL;DR: We analyze ERM in high-dimensional attention, deriving learning curves and spectral laws that explain empirical transformer behavior.
Abstract: We study empirical risk minimization in a single-head tied-attention layer trained on synthetic high-dimensional sequence tasks, given by the recently introduced attention-indexed model. Using tools from random matrix theory, spin-glass physics, and approximate message passing, we derive sharp asymptotics for training and test errors, locate interpolation and recovery thresholds, and characterize the limiting spectral distribution of the learned weights. Weight decay induces an implicit nuclear-norm regularization, favoring low-rank query and key matrices. Leveraging this, we compare the standard factorized training of query and key matrices with a direct parameterization in which their product is trained element-wise, revealing the inductive bias introduced by the factorized form. Remarkably, the predicted spectral distribution echoes empirical trends reported in large-scale transformers, offering a theoretical perspective consistent with these phenomena.
Supplementary Material: zip
Primary Area: learning theory
Submission Number: 11237
Loading