Linear Attention Optimized GPU Kernel Implementation

Linear Attention Optimized GPU Kernel Implementation

TMLR Paper5390 Authors

15 Jul 2025 (modified: 23 Oct 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: The original softmax-based attention mechanism (regular attention) in the extremely successful Transformer architecture computes attention between $N$ tokens, each embedded in a $D$-dimensional head, with a time complexity of $O(N^2D)$. Given the success of Transformers, improving their runtime during both training and inference is a popular research area. One such approach is the introduction of the linear attention (LA) mechanisms, which offers a linear time complexity of $O(ND^2)$ and have demonstrated comparable accuracy to regular attention. However, LA in practice lags behind its theoretical efficiency. We propose a novel method for LA's forward and backward passes, along with a highly-optimized CUDA implementation. Our approach outperforms the state-of-the-art by 3.3× in speed and reduces memory consumption by 3.6×. We validate these improvements in both single-layer and end-to-end settings by training a 1.4 billion parameter language model, which demonstrates similar expressivity to regular attention on major reasoning benchmarks.

Submission Length: Regular submission (no more than 12 pages of main content)

Changes Since Last Submission: To address the requested changes: 1. We have added Appendix B to provide more in-depth view on previous work. 2. Added a discussion on benefits of Linear Attention in the las paragraph of Section 1.

Assigned Action Editor: ~Ali_Ramezani-Kebrya1

Submission Number: 5390

Loading