Approximate Top-k for Increased Parallelism

Oscar Key; Luka Ribar; Alberto Cattaneo; Luke Hudlass-Galley; Douglas Orr

Approximate Top-k for Increased Parallelism

Oscar Key, Luka Ribar, Alberto Cattaneo, Luke Hudlass-Galley, Douglas Orr

Published: 10 Oct 2024, Last Modified: 05 Dec 2024AFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: approximate nearest neighbor search, long context, efficient inference

Abstract: The nearest neighbour search problem underlies many important machine learning applications, including efficient long-context generation, retrieval-augmented generation, and knowledge graph completion. However, computing top-k exactly suffers from limited parallelism, making it inefficient for highly parallel machine learning accelerators. By relaxing the requirement that the top-k is exact, bucketed algorithms can dramatically increase parallelism by independently computing many smaller top-k operations. We explore the design choices for this class of algorithms using both theoretical analysis and empirical evaluation on downstream tasks. Our motivating examples are sparsity algorithms for language models, which often use top-k to select the most important parameters or activations. We also release a fast bucketed top-k implementation for PyTorch.

Submission Number: 74

Loading