SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Published: 11 Feb 2025, Last Modified: 13 May 2025MLSys 2025 withshepherdingEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Near-lossless Sparse Attention, Long Context LLM Inference
TL;DR: We propose an adaptive structured sparse attnetion mechanism, SampleAttention, to accelerate the prefill of long context LLM inference without accuracy loss.
Abstract: Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Exisiting sparse attention approaches employ either static sparse pattern or fixed sparsity ratio to utilize the high attention sparsity, failing to capture the adaptive sparsity ratio and dynamic sparse pattern across attention heads, input contents and model architectures. To balance accuracy and performance efficiently, we introduce a robust indicator for accuracy, Cumulative Residual Attention (CRA), which measures the percentage of attention recall. Leveraging this key insight, we present SampleAttention, which employs a novel two-stage query-guided key-value filtering approach to efficiently and dynamically select a minimal set of important column and slash strips to meet a desired CRA threshold, thus maximizing efficiency while preserving accuracy. Comprehensive evaluations show that SampleAttention can establish a new Pareto frontier in the accuracy-efficiency trade-off, and reduces TTFT by up to $5.29\times$ compared with FlashAttention2.
Submission Number: 11
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview