HashAttention: Semantic Sparsity for Faster Inference

Aditya Desai; Shuo Yang; Alejandro Cuadron; Matei Zaharia; Joseph E. Gonzalez; Ion Stoica

HashAttention: Semantic Sparsity for Faster Inference

Aditya Desai, Shuo Yang, Alejandro Cuadron, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-SA 4.0

TL;DR: Sparse attention for faster inference based on learned bit signatures

Abstract: Leveraging long contexts is crucial for advanced AI systems, but attention computation poses a scalability challenge. While scaled dot-product attention (SDPA) exhibits token sparsity, i.e. only a few pivotal tokens significantly contribute to output, exploiting this sparsity remains challenging. Existing methods either suffer from quality degradation or require substantial additional resources. We show that identifying pivotal tokens is a Maximum Inner Product Search (MIPS) problem. However, existing MIPS solutions are not well-suited for SDPA, as they are not GPU-friendly and often underperform due to the separated query and key distributions. This paper introduces HashAttention, framing pivotal token identification as a recommendation problem. Given a query, HashAttention encodes keys and queries in Hamming space, capturing the required semantic similarity, using learned mapping functions. HashAttention efficiently identifies pivotal tokens for a given query using bitwise operations and computes attention using only these tokens, improving the overall attention efficiency. Trained on generic data, HashAttention reduces tokens used by up to $16\times$ with minimal quality loss, requiring only 32 bits of auxiliary memory per token. Sparsity can be further improved to $32\times$ through task-specific fine-tuning. On A100 GPU, at $32\times$ sparsity, incorporating HashAttention reduces attention latency by up to $4.3\times$ in GPT-FAST and $2.54\times$ in FlashDecode, and achieves up to $3.12\times$ higher throughput for GPT-FAST.

Lay Summary: Modern AI systems like chatbots, image generators and code assistants, etc. rely on a mechanism called “attention” to decide which parts of the input are most important — but this process becomes slow and memory-intensive as inputs get longer. We noticed that not every word or token contributes equally; only a few really matter. Our method, HashAttention, finds and focuses only on these important tokens. We discovered this could be done by treating the problem like a recommendation system — similar to how Netflix suggests shows based on your preferences. Using clever mathematical tricks and learned functions, we represent the tokens in a compact format that allows fast comparisons using simple bitwise operations. HashAttention speeds up attention without hurting accuracy. It can reduce the number of tokens processed by up to 32×, while keeping the output quality nearly the same. This leads to faster, more efficient AI models — helping them handle longer inputs, think more and produce more text with less computing power and lower costs.

Link To Code: https://github.com/xAlg-ai/HashAttention-1.0

Primary Area: Deep Learning->Attention Mechanisms

Keywords: hashing, learning to hash, attention, sparsity, time to next token, decoding, generative LLM

Submission Number: 7234

Loading