Keywords: sparse attention, kernel regression, compact kernels, Nadaraya-Watson estimator, nonparametric density estimation, transformers
Abstract: Recent work has revealed a link between self-attention in transformers and test-time kernel regression via the Nadaraya–Watson estimator, with standard softmax attention corresponding to a Gaussian kernel. However, a kernel-theoretic understanding of sparse attention mechanisms is currently missing. Motivated by neuroscience, where sparse hippocampal activity suggests memory retrieval is driven by only a small number of highly weighted similarities among well-separated memories, we develop a formal correspondence between sparse attention and compact kernels. We show that normalized ReLU and sparsemax attention arise from Epanechnikov kernel regression under fixed and adaptive normalizations, respectively. More generally, widely used kernels in nonparametric density estimation—including Epanechnikov, biweight, and triweight—correspond to $\alpha$-entmax attention with $\alpha = 1 + \frac{1}{n}$ for $n \in \mathbb{N}$, while the softmax/Gaussian relationship emerges as $n \to \infty$. This unified perspective explains how sparsity naturally arises from kernel design and provides principled alternatives to heuristic top-$k$ attention and other associative memory mechanisms. Experiments with sparse kernel-regression-based variants of transformers—Memory Mosaics—show competitive performance on language modeling, in-context learning, and length generalization tasks.
Submission Number: 10
Loading