Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction

Mutian He; Philip N. Garner

Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction

Mutian He, Philip N. Garner

Published: 03 Mar 2026, Last Modified: 25 Apr 2026ICLR 2026 Workshop MemAgentsEveryoneRevisionsBibTeXCC BY 4.0

Keywords: linear attention, hybrid attention

TL;DR: Enhanced retrieval performance of subquadratic time/constant space models by the hybrid of linear attention and query-aware sparse attention or our novel learnable token eviction network that evicts KV cache based on past and future local contexts.

Abstract: Linear-attention models that compress the entire past memory into a fixed-size recurrent state offer an efficient alternative to Transformers, but the finite memory induces forgetfulness that harms retrieval-intensive scenarios. To mitigate the issue, we explore a series of hybrid memory architectures that restore direct access to the past. We interleave layers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable eviction policy of past memory. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent context to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.

Submission Number: 65

Loading