Keywords: language model, linear attention, long-context retrieval
Abstract: Linear attention models have recently emerged as computationally efficient alternatives to Transformers.
Despite competitive performance on general commonsense tasks, they still struggle to match Transformers on long-context retrieval tasks.
In this work, we re-examine linear attention models from the perspective of memory writing.
We propose that enabling linear attention models to learn selective ignoring provides a promising approach to addressing long-context retrieval tasks under fixed memory capacity.
Guided by this principle, we demonstrate how to interpret and intervene in the behavior of linear attention models, thereby revealing the true retrieval capabilities of popular models.
Informed by these observations, we introduce Selective Ignoring Linear Attention (SILA), which incorporates a redesigned memory architecture and a weighted loss training strategy to encourage selective memory writing.
SILA exhibits remarkable long-context retrieval capabilities, achieving 20$\times$ context length extrapolation on the Passkey Retrieval task, and demonstrating superior memory utilization efficiency on the Needle-in-a-Haystack benchmark.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 23737
Loading