TL;DR: To address the issue of static sparsity allocation in hybrid sparse attention, we propose Elastic Attention, which enables the model to automatically adjust its overall sparsity based on the input.
Abstract: The quadratic complexity of standard attention mechanisms poses a significant scalability bottleneck for large language models (LLMs) in long-context scenarios. While hybrid attention strategies that combine sparse and full attention within a single model offer a viable solution, they typically employ static computation ratios (i.e., fixed proportions of sparse versus full attention) and fail to adapt to the varying sparsity sensitivities of downstream tasks during inference. To address this issue, we propose $\textit{\textbf{Elastic Attention}}$, which allows the model to dynamically adjust its overall sparsity based on the input. This is achieved by integrating a lightweight $\textit{\textbf{Attention Router}}$ into the existing pretrained model, which dynamically assigns each attention head to different computation modes. Within only 12 hours of training on 8$\times$A800 GPUs, our method enables models to achieve both strong performance and efficient inference. Experiments across three long-context benchmarks on widely-used LLMs demonstrate the superiority of our method.
Lay Summary: This paper introduces a new way to make large language models handle very long texts more efficiently. Current AI models often become slow and expensive when processing long documents because they try to pay attention to every word equally. Existing solutions reduce this cost by forcing the model to ignore some information, but they usually use a fixed strategy that cannot adapt to different tasks.
We propose Elastic Attention, a method that allows the model to automatically decide how much information it needs for each input during inference. For simpler tasks like summarization, the model can safely ignore more details and run faster. For more demanding tasks like question answering or reasoning, it can preserve more detailed attention to maintain accuracy. To achieve this, we add a lightweight routing component that dynamically assigns different attention behaviors to different parts of the model without changing the original pretrained model. Our method improves the balance between efficiency and performance while adding very little extra computation. Experiments on multiple long-context benchmarks and several widely used language models show that Elastic Attention consistently achieves stronger results than existing efficient attention methods, especially on long-document reasoning and retrieval tasks. The method can also be trained quickly and deployed efficiently on practical hardware, making long-context AI systems more accessible and cost-effective.
Originally Submitted Supplementary Material: zip
Link To Code: https://github.com/LCM-Lab/Elastic-Attention
Primary Area: Deep Learning->Large Language Models
Keywords: test-time inference, dynamic sparse attention, efficient inference
Originally Submitted PDF: pdf
Submission Number: 13542
Loading