A Theoretical Study of (Hyper) Self-Attention through the Lens of Interactions: Representation, Training, Generalization
TL;DR: We treat self-attention as an interaction learner, prove that a single linear self-attention can capture all pairwise dependencies, then introduce HyperFeatureAttention and HyperAttention to capture richer interactions.
Abstract: Self-attention has emerged as a core component of modern neural architectures, yet its theoretical underpinnings remain elusive. In this paper, we study self-attention through the lens of *interacting entities*, ranging from agents in multi-agent reinforcement learning to alleles in genetic sequences, and show that a single layer linear self-attention can *efficiently* represent, learn, and generalize functions capturing pairwise interactions, including out-of-distribution scenarios. Our analysis reveals that self-attention acts as a *mutual interaction learner* under minimal assumptions on the diversity of interaction patterns observed during training, thereby encompassing a wide variety of real-world domains. In addition, we validate our theoretical insights through experiments demonstrating that self-attention learns interaction functions and generalizes across both population distributions and out-of-distribution scenarios. Building on our theories, we introduce *HyperFeatureAttention*, a novel neural network module designed to learn couplings of different feature-level interactions between entities. Furthermore, we propose *HyperAttention*, a new module that extends beyond pairwise interactions to capture multi-entity dependencies, such as three-way, four-way, or general $n$-way interactions.
Lay Summary: Modern AI models such as ChatGPT rely on a mechanism called attention, which lets every word (or image patch, protein residue, or robot agent) decide how strongly it should “listen” to all the others. Despite its success, we still lack a clear, mathematical picture of why this mechanism works so well. Our study views each word or agent as an interacting entity and proves that even a single simplified attention layer can efficiently capture pairwise relationships in the data, under some assumptions. We further show that ordinary training methods will reliably reach these ideal parameters and that the resulting model naturally handles entirely new data and even much longer sequences than it saw during training. Put simply, a self-attention block can serve as a near-perfect mutual interaction learner. Building on these insights, we introduce two new attention blocks -HyperFeatureAttention, which is coupled feature interaction learner, and HyperAttention which is high-order interaction learner (three-way or four-way n-way). Toy language-model experiments confirm the advantages of these richer blocks. By revealing how attention learns interactions and how to extend it, our work lays a foundation for more efficient, trustworthy AI systems in areas ranging from multi-agent control to genomics.
Primary Area: Theory->Deep Learning
Keywords: attention, self-attention, deep learning, theory, representation, convergence, generalization, interactions
Submission Number: 11420
Loading