Loki: Low-rank Keys for Efficient Sparse Attention

Prajwal Singhania; Siddharth Singh; Shwai He; Soheil Feizi; Abhinav Bhatele

Loki: Low-rank Keys for Efficient Sparse Attention

Prajwal Singhania, Siddharth Singh, Shwai He, Soheil Feizi, Abhinav Bhatele

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0

Keywords: Approximate Attention, Low-Dimensional Attention Keys, PCA, Top-K Attention

Abstract: Inference on large language models (LLMs) can be expensive in terms of the compute and memory costs involved, especially when long sequence lengths are used. In particular, the self-attention mechanism used in LLM inference contributes significantly to these costs, which has sparked an interest in approximating the self-attention computation to reduce such costs. In this work, we propose to approximate self-attention by focusing on the dimensionality of key vectors computed in the attention block. Our analysis reveals that key vectors lie in a significantly lower-dimensional space, consistently across several datasets and models. Exploiting this observation, we propose Loki, a novel sparse attention method that ranks and selects tokens in the KV-cache based on attention scores computed in low-dimensional space. Our evaluations show that Loki is able to speed up the attention computation due to reduced data movement (load/store) and compute costs while maintaining the efficacy of the models better than other popular approximation methods.

Supplementary Material: zip

Primary Area: Infrastructure (libraries, improved implementation and scalability, distributed solutions)

Submission Number: 14363

Loading