Keywords: Sparse Attention, Long-context Inference
TL;DR: We present Adamas, a lightweight yet accurate sparse attention mechanism, achieving up to 4.4× self-attention and 1.5× end-to-end speedups on 32K sequences with near-lossless accuracy.
Abstract: Large language models (LLMs) now support context windows of hundreds of thousands to millions of tokens, enabling applications such as long-document summarization, large-scale code synthesis, multi-document question answering and persistent multi-turn dialogue. However, such extended contexts exacerbate the quadratic cost of self-attention, leading to severe latency in autoregressive decoding. Existing sparse attention methods alleviate these costs but rely on heuristic patterns that struggle to recall critical key-value (KV) pairs for each query, resulting in accuracy degradation. We introduce **Adamas**, a lightweight yet highly accurate sparse attention mechanism designed for long-context inference. Adamas applies the Hadamard transform, bucketization and 2-bit compression to produce compact representations, and leverages Manhattan-distance estimation for efficient top-$k$ selections. Experiments show that Adamas matches the accuracy of full attention with only a 64-token budget, achieves near-lossless performance at $128$, and supports up to $8\times$ higher sparsity than prior state-of-the-art (SOTA) methods while delivering up to $4.4\times$ self-attention and $1.5\times$ end-to-end speedups on 32K-length sequences. Remarkably, Adamas attains comparable or even lower perplexity than full attention, underscoring its effectiveness in maintaining accuracy under aggressive sparsity. Code is publicly available at https://anonymous.4open.science/r/Adamas-36EA.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Submission Number: 2783
Loading