XAttention: Block Sparse Attention with Antidiagonal Scoring

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We introduce XAttention, a plug-and-play method that uses antidiagonal sums to efficiently identify important parts of the attention matrix, achieving up to 13.5x speedup on long-context tasks with comparable accuracy to full attention.
Abstract: Long-Context Transformer Models (LCTMs) are vital for real-world applications but suffer high computational costs due to attention's quadratic complexity. Block-sparse attention mitigates this by focusing computation on critical regions, yet existing methods struggle with balancing accuracy and efficiency due to costly block importance measurements. In this paper, we introduce XAttention, a plug-and-play framework that dramatically accelerates long-context inference in Transformers models using sparse attention. XAttention's key innovation is the insight that the sum of antidiagonal values (i.e., from the lower-left to upper-right) in the attention matrix provides a powerful proxy for block importance. This allows for precise identification and pruning of non-essential blocks, resulting in high sparsity and dramatically accelerated inference. Across comprehensive evaluations on demanding long-context benchmarks—including RULER and LongBench for language, VideoMME for video understanding, and VBench for video generation—XAttention achieves accuracy comparable to full attention while delivering substantial computational gains. We demonstrate up to 13.5x acceleration in attention computation. These results underscore XAttention's ability to unlock the practical potential of block sparse attention, paving the way for scalable and efficient deployment of LCTMs in real-world applications.
Lay Summary: AI models that process long documents or videos, known as Long-Context Transformer Models (LCTMs), are powerful but face a major hurdle: they are computationally expensive, largely due to a component called the attention mechanism. The cost of this mechanism grows quadratically with the length of the information, creating a significant bottleneck. To solve this, researchers have developed "block-sparse attention," an approach that saves time by having the AI focus only on the most critical blocks of information instead of every single detail. However, existing methods have struggled because the process of figuring out which blocks are important can itself be slow and inefficient, canceling out the benefits. Our paper introduces XAttention, a new framework that makes these models much faster without sacrificing accuracy. The key discovery is that summing up values along antidiagonals (lines running from the lower-left to the upper-right) within the model's attention grid is a surprisingly simple and effective way to measure a block's importance. This "antidiagonal scoring" method allows XAttention to quickly identify and prune away non-essential computations. Evaluated on demanding tasks involving long-form language, video analysis, and video generation, XAttention demonstrated performance comparable to the original, slower models while providing significant speed-ups. For instance, it achieved up to a 13.5x acceleration in the core attention computation. These results show that XAttention makes powerful AI for long-form content more practical and efficient for real-world applications.
Link To Code: https://github.com/mit-han-lab/x-attention
Primary Area: Deep Learning->Large Language Models
Keywords: Large Language Models, Efficiency
Submission Number: 3943
Loading