Toward Linearly Regularizing the Geometric Bottleneck of Linear Generalized Attention

TMLR Paper4675 Authors

15 Apr 2025 (modified: 28 May 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Transformers excel across domains, yet their full self-attention carries a prohibitive $\mathcal{O}(n^2)$ cost for long sequences with length $n$. Existing \textit{efficient} attention methods either restrict the attention pattern (local/sparse attention) or approximate the softmax kernel with certain drawbacks. The former suffers from attention bottlenecks (over-squashing of long-range dependencies) and invalidates the use of global tokens in autoregressive tasks, while the latter often requires sequential processing that can degrade in accuracy when approximations fall short. In this work, we introduce a novel attention mechanism, \textit{Bottleneck Regularized Linear Attention (BRL-Attention)}, uniting the strengths of pattern-based and kernel-based techniques to enable efficient, global information flow with linear complexity. BRL-Attention extends a local attention pattern with a small set of compressed tokens that serve as a global information reservoir, ensuring long-range interactions without quadratic cost. This bottleneck regularization strategy effectively alleviates the geometric attention bottleneck and retains full expressiveness; that is, it matches the sequence modeling capacity of full softmax attention while mitigating over-squashing across layers. Moreover, it integrates global tokens without breaking causal masking, making it applicable to both encoder-only and autoregressive decoder architectures. Extensive experiments on long-sequence and graph benchmarks show that BRL-Attention matches or exceeds the predictive performance of standard Transformers with full attention, while substantially reducing memory usage and computation time. These results underscore its potential as a scalable, drop-in replacement for existing attention mechanisms.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Shuangfei_Zhai3
Submission Number: 4675
Loading