Keywords: LLM, Long-context Prefill, Decomposition, Sparse Pattern, Sparse Attention, Dynamic Sparse Attention.
Abstract: Multi-head attention (MHA) and grouped query head attention (GQA) consti-
tute essential architectural components of modern large language models (LLMs).
Even though attention computations remain relatively inexpensive for small-scale
inputs, the computational cost increases quadratically as the input size expands.
In long-context scenarios, including tasks such as book-level summarization or
code repos analysis, time-to-first-token (TTFT) performance can deteriorate sig-
nificantly. Although various studies have improved prefill stage performance by
exploiting sparsity structure, sparsity can still be further increased with structure
refinements.
In this work, we propose an approximate on-line decomposition of the attention
matrix which is able to dynamically identify additional sparsity. The attention
matrix is decomposed into three components: a slash component, a vertical com-
ponent, and a horizontal component. Each component requires only linear space,
thereby enabling more efficient processing compared to the full attention matrix.
The decomposition is computed from query and key tokens using a linear-time
algorithm. The statistical properties of the decomposition allow generation of the
mask by merely selecting elements that exceed a threshold. The threshold itself
can be chosen to limit the difference with regular dense attention or to respect a
certain time-budget.
We demonstrate that this technique can be directly applied – without requiring
retraining – to networks employing standard dense attention mechanisms (MHA,
GQA) and RoPE. We show that precision is maintained across the ∞Bench and
PG-19 benchmarks for LLAMA-3-8B-INSTRUCT-1048K. Furthermore, we ob-
serve substantial increases in sparsity and corresponding speedup compared to
previous methods. We halve the number of FLOP relative to State-of-the-Art on
one million tokens.
Primary Area: generative models
Submission Number: 19739
Loading