Keywords: Attention, GPU, Hardware-aware
TL;DR: LeanAttention, a scalable, hardware-efficient, “exact” attention acceleration mechanism for the decode-phase of transformer-based models.
Abstract: Transformer-based large language models are memory hungry and incur significant inference latencies even
on cutting edge AI-accelerators, such as GPUs. Specifically, the time and memory complexity of the attention
operation is quadratic in terms of the total context length, i.e., prompt and output tokens.
To that end, we propose LeanAttention, a scalable, hardware-efficient, “exact” attention acceleration mechanism
for the decode-phase of transformer-based models. LeanAttention enables scaling the attention mechanism for the
challenging case of long context lengths by re-designing the attention execution flow for the decode-phase. As a
result, we achieve an average of 1.73x speedup in attention execution compared to FlashDecoding, with up to
2.18x speedup for 256k context length.
Supplementary Material: pdf
Submission Number: 283
Loading