Keywords: Linear Attention, Sparse Attention, Subquadratic Architectures
TL;DR: We identify difficult-to-memorize tokens at inference time to improve in-context recall of hybrid linear attention models.
Abstract: The inference cost of transformer-based large language models is proportional to the context length. This prevents their application to lifelong in-context learning. Linear attention is an efficient alternative that maintains a constant memory footprint, even on infinite context lengths. While this is a potential candidate for lifelong learning, it falls short in memory capacity. In this paper, we propose LoLA, a training-free augmentation to linear attention that boosts associative recall. LoLA distributes past key-value pairs from context into three memory systems: (i) recent pairs in a local sliding window cache; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. We empirically show through ablations that our self-recall error metric is crucial to efficiently manage long-term associative memories. LoLA improves the long-context performance of the base model on the RULER benchmark, with improvements from 0.6% to 97.4% on pass-key retrieval. This is achieved with a 4.6x smaller cache than Llama-3.1 8B on 4K context length. LoLA also outperforms other 1B and 8B parameter subquadratic models on zero-shot commonsense reasoning tasks.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 21684
Loading