Keywords: Trainable Sparse Attention, Dynamic Mask Attention
Abstract: Self-attention's computational cost, which scales quadratically with sequence length, creates a fundamental bottleneck for long-context modeling in LLMs, limiting applications such as document understanding, multi-turn reasoning, and code generation. Sparse attention has been proposed to mitigate this issue. Early content-agnostic designs such as sliding-window and block-sparse attention reduce computational complexity based on fixed patterns. However, theirstatic structure often overlook important long-range dependencies and lack adaptivity to diverse query contexts. Recent content-aware methods improve adaptivity by conditioning attention sparsity on token representations, but they typically rely on hard binary masks or heuristic key-value selection, introducing runtime overhead and hindering fully differentiability. We propose Dynamic Mask Attention (DMA), a trainable content-aware sparse attention mechanism with head-wise specialization. DMA dynamically generates content-driven dynamic masks with continuous importance weights based on value representations, enabling both expressiveness and full differentiability. We theoretically prove that masked entries are mathematically equivalent to zero in both the forward and backward passes, thereby ensuring unbiased gradients. Furthermore, we developed efficient CUDA kernels with block-skipping for practical acceleration. Extensive experiments demonstrate that DMA consistently outperforms state-of-the-art sparse attention baselines across pre-training and downstream tasks, reducing perplexity, improving accuracy, and delivering substantial long-sequence speedups of up to 10 times.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 4299
Loading