Accelerating Linear Attention Design by Unifying Forward & Backward Propagation

Published: 11 Jun 2025, Last Modified: 10 Jul 2025ES-FoMo IIIEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Linear Attention; Kernel Design
TL;DR: Speedup linear attention kernel design with a uniform formulation.
Abstract: The rapid evolution of linear attention has led to the proliferation of various novel methods in recent years. However, the design and implementation process of linear attention mechanisms remains inherently complex and cumbersome. Typically, this process is composed of four essential stages: 1) Formulating the forward recursive expression; 2) Implementing the chunk-parallel approach for forward propagation; 3) Deriving the recursive formulation for backpropagation; and 4) Implementing the chunk-parallel backpropagation and finalizing the kernel implementation. This multifaceted design pipeline represents a significant impediment to the efficient development and exploration of new linear attention variants. In this paper, we demonstrate that both forward and backward propagation in linear attention can be expressed through a unified functional framework. By manipulating the input parameters, the function can yield either forward or backward propagation results. This approach substantially reduces the development effort associated with linear attention kernel implementation. We validate our method across multiple linear attention variants, including constant decay, scalar decay, and vector decay, within the context of language modeling tasks. Despite the reduction in development effort, experimental results demonstrate that the kernels implemented using our approach outperform the original implementations in both speed and memory efficiency.
Submission Number: 91
Loading