Gaussian Process Attention with Kernel Modeling

Gaussian Process Attention with Kernel Modeling

ACL ARR 2025 May Submission1123 Authors

16 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Transformers dominate NLP, yet their core component, self-attention, remains a heuristic, lacking a robust theoretical foundation. This paper reinterprets self-attention with rotary positional embeddings (RoPE) as Nadaraya-Watson kernel regression, unlocking a novel framework for enhancing attention through kernel modeling. We introduce Gaussian Process Attention (GPA), which augments RoPE with a bank of decaying periodic kernels to capture linguistic patterns like periodicity and decay. Tested on a GPT model with character-level tokenization and a 13-million-character corpus, GPA outperforms baseline RoPE, reducing mean cross-entropy loss. GPA kernel banks enable mechanistic interpretability, revealing linguistic structures—such as paragraph lengths—and identifying redundant attention heads for model pruning. With only a few additional parameters, GPA enhances efficiency without sacrificing performance. Our work bridges kernel methods and Transformers, providing a theoretical lens for attention while delivering practical gains in performance and interpretability. We pave the way for scalable, interpretable NLP models, with implications for optimizing large-scale Transformers and understanding their inner workings.

Paper Type: Short

Research Area: Machine Learning for NLP

Research Area Keywords: Generative Models, Representation Learning, Generalization

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 1123

Loading