Alleviating Forgetfulness of Linear Attention by Hybrid Sparse Attention and Contextualized Learnable Token Eviction
Keywords: linear attention, hybrid attention
TL;DR: Enhanced retrieval performance of subquadratic time/constant space models by the hybrid of linear attention and query-aware sparse attention or our novel learnable token eviction network that evicts KV cache based on past and future local contexts.
Abstract: Linear-attention models that compress the entire past memory into a fixed-size recurrent state offer an efficient alternative to Transformers, but the finite memory induces forgetfulness that harms retrieval-intensive scenarios.
To mitigate the issue, we explore a series of hybrid memory architectures that restore direct access to the past. We interleave layers with intermediate time and space complexity between linear and full attention, including sparse attention with token eviction, and the query-aware native sparse attention. Particularly, we propose a novel learnable eviction policy of past memory. Combined with sliding-window attention, an end-to-end trainable lightweight CNN aggregates information from both past and future adjacent context to adaptively retain a limited set of critical KV-pairs per head, maintaining linear attention's constant time and space complexity. Efficient Triton kernels for the sparse attention are provided. Empirical evaluations on retrieval-intensive benchmarks support the effectiveness of our approaches.
Submission Number: 65
Loading