Keywords: long-context language model, efficiency, inference-time method
TL;DR: We propose a method to speed up inference for long-context LM by leveraging attention score to selectively attend to input tokens.
Abstract: Processing long-context input imposes a heavy computational burden when deploying large language models. Recently proposed inference-time methods accelerate generation by attending only to local context. Despite its efficiency gains, this approach fails to capture all relevant information in the input, showing substantial performance drop in long-context benchmarks. We propose recycled attention, an efficient and effective method which alternates between full context attention and attention over a subset of input tokens. When performing partial attention, we leverage the attention pattern of a nearby token that has performed full attention and attend only to the top K most attended tokens. We evaluate our methods on RULER, a suite of tasks designed to comprehensively evaluate long-context abilities, and long-context language modeling tasks. Applying our inference method to off-the-shelf LLMs achieves comparable speedup to baselines which only consider local context while improving the performance by 2x. We further experiment with continued pre-training the model with recycled attention to improve the performance-efficiency trade-off.
Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 9058
Loading