LeanAttention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers

Published: 11 Feb 2025, Last Modified: 13 May 2025MLSys 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Attention, GPU, Hardware-aware
TL;DR: LeanAttention, a scalable, hardware-efficient, “exact” attention acceleration mechanism for the decode-phase of transformer-based models.
Abstract: Transformer-based large language models are memory hungry and incur significant inference latencies even on cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attention operation is quadratic in terms of the total context length, i.e., prompt and output tokens. To that end, we propose LeanAttention, a scalable, hardware-efficient, “exact” attention acceleration mechanism for the decode-phase of transformer-based models. LeanAttention enables scaling the attention mechanism for the challenging case of long context lengths by re-designing the attention execution flow for the decode-phase. As a result, we achieve an average of 1.73x speedup in attention execution compared to FlashDecoding, with up to 2.18x speedup for 256k context length.
Supplementary Material: pdf
Submission Number: 283
Loading