RSA: Recursive Sparse Attention with Hierarchical Deep–Shallow Memory and Sparse Activation

22 Jan 2026 (modified: 24 Jun 2026)Submitted to ICML 2026EveryoneRevisionsBibTeXCC BY 4.0
Abstract: Linear sequence models offer strong efficiency advantages for long-context modeling due to their linear complexity. However, they still lag behind standard Transformers on many tasks, especially those requiring long-range reasoning and retrieval. Recent work has explored gating mechanisms, delta rules, and multi-states to narrow this gap. Despite these efforts, most existing methods primarily focus on short-range information, which limits their effectiveness on complex long-horizon tasks. Fundamentally, this limitation stems from insufficient information utilization and constrained effective memory capacity. Motivated by insights from neuroscience, we introduce a biologically inspired shallow–deep memory architecture, in which multiple memory states are connected and superposed in a structured manner. The shallow memory captures coarse-grained representations, while the deep memory stores residual information. We show that this design can theoretically match the storage capacity of standard attention. Furthermore, we adopt a sparse-attention-like readout mechanism that effectively enhances attention concentration. By jointly designing the memory storage and retrieval processes, we propose Recursive Sparse Attention (RSA), a novel attention mechanism that bridges the gap between linear models and standard attention. RSA establishes a principled foundation for linear architectures to approach, and potentially surpass, the expressive power of full attention. Empirically, language models built upon RSA achieve performance comparable to or exceeding that of Gated DeltaNet, RWKV-7, and Transformer across a diverse set of language benchmarks.
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Sparse Attention, Linear Attention, High-Efficiency Sequence Modelling
Submission Number: 16685
Loading