Abstract: Transformers dominate NLP, yet their core component, self-attention, remains a heuristic, lacking a robust theoretical foundation. This paper reinterprets self-attention with rotary positional embeddings (RoPE) as Nadaraya-Watson kernel regression, unlocking a novel framework for enhancing attention through kernel modeling. We introduce Gaussian Process Attention (GPA), which augments RoPE with a bank of decaying periodic kernels to capture linguistic patterns like periodicity and decay. Tested on a GPT model with character-level tokenization and a 13-million-character corpus, GPA outperforms baseline RoPE, reducing mean cross-entropy loss. GPA kernel banks enable mechanistic interpretability, revealing linguistic structures—such as paragraph lengths—and identifying redundant attention heads for model pruning. With only a few additional parameters, GPA enhances efficiency without sacrificing performance. Our work bridges kernel methods and Transformers, providing a theoretical lens for attention while delivering practical gains in performance and interpretability. We pave the way for scalable, interpretable NLP models, with implications for optimizing large-scale Transformers and understanding their inner workings.
Paper Type: Short
Research Area: Machine Learning for NLP
Research Area Keywords: Generative Models, Representation Learning, Generalization
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 1123
Loading