Toward Linearly Regularizing the Geometric Bottleneck of Linear Generalized Attention

Jiaxu Liu; Xinping Yi; Xiangyu Yin; Yuhang Song; Gaojie Jin; Xiaowei Huang

Toward Linearly Regularizing the Geometric Bottleneck of Linear Generalized Attention

Jiaxu Liu, Xinping Yi, Xiangyu Yin, Yuhang Song, Gaojie Jin, Xiaowei Huang

Published: 23 Jul 2025, Last Modified: 23 Jul 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Transformers excel across domains, yet their full self-attention carries a prohibitive $\mathcal{O}(n^2)$ cost for long sequences with length $n$. Existing \textit{efficient} attention methods either restrict the attention pattern (local/sparse attention) or approximate the softmax kernel with certain drawbacks. The former suffers from attention bottlenecks (over-squashing of long-range dependencies) and invalidates the use of global tokens in autoregressive tasks, while the latter often requires sequential processing that can degrade in accuracy when approximations fall short. In this work, we introduce the \textit{Bottleneck Regularized Linear Attention (BRL-Attention)}, uniting the strengths of pattern-based and kernel-based techniques to enable efficient, global information flow with linear complexity. BRL-Attention extends a local attention pattern with a small set of compressed tokens that serve as a global information reservoir, ensuring long-range interactions without quadratic cost. This bottleneck regularization strategy effectively alleviates the geometric attention bottleneck and retains full expressiveness; that is, it matches the sequence modeling capacity of full softmax attention while mitigating over-squashing across layers. Moreover, it integrates global tokens without breaking causal masking, making it applicable to both encoder-only and autoregressive decoder architectures. Extensive experiments on sequence and graph benchmarks demonstrate that BRL-Attention matches or surpasses the predictive performance of standard Transformers with full attention, while substantially reducing memory usage and computation time to levels comparable with linear sparse attention.

Submission Length: Regular submission (no more than 12 pages of main content)

Code: https://github.com/ljxw88/Regularizing-Geometric-Bottleneck

Assigned Action Editor: ~Shuangfei_Zhai3

Submission Number: 4675

Loading