Keywords: Transformers, Efficient Attention, Attention Mechanisms, Low-Rank Approximation, Nyström Method, Theoretical Machine Learning, Geometric Deep Learning, Subspace Clustering
TL;DR: We theoretically prove that selecting attention landmarks via geometric clustering (k-means) yields a provably better low-rank approximation than random sampling.
Abstract: Nystr{\"o}m-based approximation is a prominent strategy for achieving linear-time self-attention, yet its standard reliance on uniform random sampling is often misaligned with the non-uniform spectral properties of learned token embeddings. This work provides a rigorous basis for a data-aware, geometric sampling strategy that directly exploits this structure. We introduce and formalize \textit{block-coherence}, a spectral property of matrices where statistical leverage is concentrated within discoverable clusters. We then prove our main theoretical result: for matrices exhibiting this property, landmark selection via \textit{k-means clustering} achieves a provably tighter Frobenius norm approximation bound than uniform sampling. Our proof establishes a formal connection between the variance-minimizing k-means objective and the concentration of leverage scores, showing that k-means acts as an effective proxy for adaptive importance sampling. A multi-tiered empirical study validates our theory. We first verify that block-coherence is a consistent, emergent property of diverse architectures (BERT, Llama, ViT). We then demonstrate that this structure yields a 25-35\% reduction in Nystr{\"o}m reconstruction error over random sampling. Finally, our algorithmic realization, \textit{Geometric Progressive Attention (GPA)}, achieves state-of-the-art performance among efficient methods on the Long Range Arena (LRA) benchmark, demonstrating that superior approximation quality translates directly to improved downstream performance.
Primary Area: learning theory
Submission Number: 20651
Loading