RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Inference with Long Decoding Chains

RaaS: Reasoning-Aware Attention Sparsity for Efficient LLM Inference with Long Decoding Chains

ACL ARR 2025 February Submission397 Authors

07 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have demonstrated significant capabilities across various domains, with recent advancements in challenging reasoning tasks such as mathematics and programming. However, solving these tasks often requires long decoding chains (of thoughts), which incurs $O(N)$ time and memory consumption, where $N$ is the chain length. For instance, in the Llama 3.1 8B model, such chains can lead to up to 128k tokens and 16GB of intermediate data --- key-value (KV) cache for a single request. To mitigate high time and memory consumption, sparsity-based algorithms propose retaining only the most critical token's KV and discarding the rest. However, existing algorithms struggle with the ``impossible trinity'' of accuracy, time, and memory. For example, the state-of-the-art algorithm, Quest, achieves high accuracy, $O(L)$ time but $O(N)$ memory, where $L$ is the cache budget and $L \ll N$. In this paper, we identify a new attention pattern during the decoding stage of reasoning tasks, where milestone tokens, analogous to lemmas in mathematical proofs, emerge, are utilized, and then become unimportant. Since information is condensed into these milestone tokens, we propose \algo, an algorithm that identifies and retains milestone tokens until they are no longer needed, achieving high accuracy, $O(L)$ time, and $O(L)$ memory complexity.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: NLP in resource-constrained settings

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 397

Loading