LoLA: Low-Rank Linear Attention with Sparse Caching

LoLA: Low-Rank Linear Attention with Sparse Caching

09 Mar 2026 (modified: 12 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The per-token cost of transformer inference scales with context length, preventing its application to lifelong in-context learning. Linear attention is an efficient alternative that maintains a constant memory footprint, even on infinite context lengths. While this is a potential candidate for lifelong learning, it falls short in memory capacity. In this paper, we propose LoLA, a training-free augmentation to linear attention that boosts associative recall. LoLA distributes past key-value pairs from context into three memory systems: (i) recent pairs in a local sliding window cache; (ii) difficult-to-memorize pairs in a sparse, global cache; and (iii) generic pairs in the recurrent hidden state of linear attention. We show through ablations that our self-recall error metric is crucial to efficiently manage long-term associative memories. On pass-key retrieval tasks, LoLA improves the base model's performance from 0.6% to 97.4% accuracy. This is achieved with a 4.6x smaller cache than Llama-3.1 8B on 4K context length. LoLA also outperforms other 1B and 8B parameter subquadratic models on zero-shot commonsense reasoning tasks.

Submission Type: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: Removed "pareto-front" typo, fixed Table 12 typo, & added anonymous repository link in the psuedo-code appendix section.

Assigned Action Editor: ~Antonio_Orvieto3

Submission Number: 7850

Loading