AdaSplash: Adaptive Sparse Flash Attention

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: An efficient flash attention implementation for adaptive sparsity.
Abstract: The computational cost of softmax-based attention in transformers limits their applicability to long-context tasks. Adaptive sparsity, of which $\alpha$-entmax attention is an example, offers a flexible data-dependent alternative, but existing implementations are inefficient and do not leverage the sparsity to obtain runtime and memory gains. In this work, we propose AdaSplash, which combines the efficiency of GPU-optimized algorithms with the sparsity benefits of $\alpha$-entmax. We first introduce a hybrid Halley-bisection algorithm, resulting in a 7-fold reduction in the number of iterations needed to compute the $\alpha$-entmax transformation. Then, we implement custom Triton kernels to efficiently handle adaptive sparsity. Experiments with RoBERTa and ModernBERT for text classification and single-vector retrieval, along with GPT-2 for language modeling, show that our method achieves substantial improvements in runtime and memory efficiency compared to existing $\alpha$-entmax implementations. It approaches---and in some cases surpasses---the efficiency of highly optimized softmax implementations like FlashAttention-2, enabling long-context training while maintaining strong task performance.
Lay Summary: Transformers, the backbone of modern language models, use softmax attention to decide how much focus each token (e.g. word) gives to others. While effective, softmax always assigns some importance to all tokens, even irrelevant ones, making it harder for models to focus sharply on important tokens. A promising alternative is adaptively sparse $\alpha$-entmax attention, which learns to ignore irrelevant words by assigning them exactly zero weight, allowing models to focus more selectively. However, prior implementations of $\alpha$-entmax have been slow and memory-intensive for practical use. To solve this, we introduce AdaSplash, a fast and GPU-friendly implementation of $\alpha$-entmax attention. With this, we significantly cut down both computation and required memory usage compared to previous methods, and as a result, AdaSplash closes the longstanding gap between the theoretical appeal of $\alpha$-entmax attention and its practical usability in large-scale, long-context applications. All code is open-source to support broader adoption and further research.
Link To Code: https://github.com/deep-spin/adasplash
Primary Area: Deep Learning->Attention Mechanisms
Keywords: Sparse Attention, Flash Attention, Adaptive Sparsity, Long Context Transformers
Submission Number: 12577
Loading