Attention and Compression is all you need for Controllably Efficient Language Models

ICLR 2026 Conference Submission22217 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: efficient architecture, attention, compression, adaptive architecture
Abstract: The quadratic cost of attention in transformers motivated the development of cheap approximations: namely sparse or sliding window attention, convolutions and linear attention. These approximations come with limitations; they drive down in-context recall as memory in the recurrent state and compute decrease. A priori fixing this quality-compute tradeoff in an architecture means being suboptimal: some downstream applications require good in-context recall, while others require lower latency and memory. Further, these approaches require heuristic choices for attention masks, handcrafted and careful recurrent state update rules, or need to be composed with attention layers to create a hybrid architecture that complicate the design. To address this, we propose a simple architecture called the Compress & Attend Transformer (CAT) that decodes each token attending to a chunk of neighbouring tokens and to compressed chunks of the sequence so far. Choosing a chunk size trades off quality for compute and memory. Moreover, CATs can be trained with multiple chunk sizes at once, unlocking control of quality-compute trade-offs directly at test-time without any retraining, all in a single adaptive architecture. On exhaustive evaluations on language modeling, common-sense reasoning, in-context recall and long-context understanding, CATs outperform many existing efficient baselines including the hybrids when inference time and memory matched, and is competitive with the dense transformer in language modeling while being 1.5−3$\times$ faster and requiring 2−9$\times$ lower memory, depending on the chosen chunk size.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 22217
Loading