SILA: Enhancing Long-Context Retrieval Capability of Linear Attention via Selective Ignoring

Jiayu Zhang; Shuyang Chen; Qian Zheng; Rui Yan; Gang Pan; Huajin Tang

SILA: Enhancing Long-Context Retrieval Capability of Linear Attention via Selective Ignoring

Jiayu Zhang, Shuyang Chen, Qian Zheng, Rui Yan, Gang Pan, Huajin Tang

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: language model, linear attention, long-context retrieval

Abstract: Linear attention models have recently emerged as computationally efficient alternatives to Transformers. Despite competitive performance on general commonsense tasks, they still struggle to match Transformers on long-context retrieval tasks. In this work, we re-examine linear attention models from the perspective of memory writing. We propose that enabling linear attention models to learn selective ignoring provides a promising approach to addressing long-context retrieval tasks under fixed memory capacity. Guided by this principle, we demonstrate how to interpret and intervene in the behavior of linear attention models, thereby revealing the true retrieval capabilities of popular models. Informed by these observations, we introduce Selective Ignoring Linear Attention (SILA), which incorporates a redesigned memory architecture and a weighted loss training strategy to encourage selective memory writing. SILA exhibits remarkable long-context retrieval capabilities, achieving 20$\times$ context length extrapolation on the Passkey Retrieval task, and demonstrating superior memory utilization efficiency on the Needle-in-a-Haystack benchmark.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23737

Loading