Attention is Not Always Needed: Attention Sink Forges a Native MoE in Attention Layers

ICLR 2026 Conference Submission12918 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Attention Sink, Gated Attention, Large Language Models
Abstract: Large Language Models (LLMs) often assign disproportionate attention to the first token, a phenomenon known as the attention sink. Several recent approaches aim to address this issue, including Sink Attention in GPT-OSS and Gated Attention in Qwen3-Next. However, a comprehensive analysis of attention sink is lacking, and the necessity of eliminating it remains unclear. In this work, we provide both theoretical and empirical evidence showing that attention sink emerges as a mechanism to resolve forced attention, yet it limits the model’s expressive capacity. By analyzing the connection between attention sink and Gated Attention, we demonstrate that attention sink implicitly constructs a native Mixture-of-Experts (MoE) within attention layers. This insight reveals why only a fixed subset of attention heads contributes to generation, which closely resembles the expert collapse problem encountered in MoE. To enhance the utilization balance of attention heads, we propose a sink-aware training algorithm with an auxiliary load balancing loss designed for attention layers. We hope this study offers a new practical perspective on attention sink and Gated Attention, and encourages further exploration of how to leverage the inherent MoE mechanisms within attention layers.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 12918
Loading