Measure Once, Mask Once: Delta Refined Block Sparse Attention

Measure Once, Mask Once: Delta Refined Block Sparse Attention

ICLR 2026 Conference Submission16309 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: sparse attention, large language model, efficient inference

TL;DR: We utilize fast query sparse attention kernels to generate dynamic key-sparse block attention masks

Abstract: Long context inference poses a problem for large language models (LLMs) due to the high cost of quadratic attention with long input lengths. Efficient long context inference is a necessity in order to provide low-cost, low-latency LLM serving endpoints. Sparse attention is one way to mitigate the high cost of long context prefills. Many recent state-of-the-art sparse attention methods can be applied on top of pretrained quadratic transformers without any specific finetuning regimen, however, the main obstacle to overcome when designing sparse attention method lies in deciding which parts to compute and which parts to ignore during the sparse computation. Previous works generally make this decision based on heuristics derived from recurring patterns in the attention matrix or pooled block statistics to select a key-sparse attention mask. We show that these methods result in a suboptimal capture of total attention score mass. In another line of work, key-sparse attention has been shown to induce a distributional shift in attention outputs that can be mitigated by mixing query-sparse attention with existing key-sparse attention masks and combining the outputs. In order to save computation, we propose fusing the query-sparse attention and sparse attention mask generation process, resulting in a novel, dynamic, and query-dependent sparse mask generation. Our method calculates a key-sparse block mask while computing query-sparse attention, and then uses this dynamic attention mask to perform key-sparse attention before combining the outputs. Our method delivers a 2.5x speedup over Flash Attention 3 at 1M tokens and results in a total attention capture which is within 1.5\% of the oracle block top-k attention.

Supplementary Material: zip

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Submission Number: 16309

Loading