SparQ Attention: Bandwidth-Efficient LLM Inference

ICLR 2024 Workshop ME-FoMo Submission56 Authors

Published: 04 Mar 2024, Last Modified: 02 May 2024ME-FoMo 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: language models, sparse attention, sparsity, efficient inference, transformer
TL;DR: An attention sparsity technique for improving the throughput of pre-trained LLM inference.
Abstract: Through an analysis of the statistical properties of pre-trained large language models (LLMs), we highlight two opportunities for sparse memory access: first in the components of query and keys and second in the attention scores corresponding to key, value pairs. Based on this, we introduce **SparQ Attention**, a technique for increasing the inference throughput of LLMs by utilising memory bandwidth more efficiently within attention layers, through selective fetching of the cached history. Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show that SparQ Attention brings up to $8\times$ savings in attention data-transfers without substantial drops in accuracy, by evaluating Llama $2$, Mistral and Pythia models on a wide range of downstream tasks.
Submission Number: 56
Loading