Hydragen: High-Throughput LLM Inference with Shared Prefixes

Jordan Juravsky; Bradley Brown; Ryan Saul Ehrlich; Daniel Y Fu; Christopher Re; Azalia Mirhoseini

Hydragen: High-Throughput LLM Inference with Shared Prefixes

Jordan Juravsky, Bradley Brown, Ryan Saul Ehrlich, Daniel Y Fu, Christopher Re, Azalia Mirhoseini

Published: 21 Jun 2024, Last Modified: 24 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models, high throughput, hardware-awareness

TL;DR: We present an efficient implementation of attention specialized for batches of sequences that share a prefix.

Abstract:

As large language models (LLMs) are deployed more broadly, reducing the cost of inference has become increasingly important. A common inference use case involves a batch of sequences that share a prefix, such as when reusing few-shot examples or sampling many completions from a single prompt. In a large-batch setting, transformer decoding can be bottlenecked by the attention operation, which reads large key-value (KV) caches from memory and computes inefficient matrix-vector products for every sequence in the batch. In this work, we introduce Hydragen, a hardware-aware exact implementation of attention specialized for shared prefixes. Hydragen computes attention separately over the shared prefix and unique suffixes. This decomposition enables efficient prefix attention by batching queries together across sequences, reducing redundant memory reads and replacing matrix-vector products with hardware-friendly matrix-matrix products. In a high-throughput setting (batch size 1K, tensor parallelism across eight A100s), our method can improve end-to-end CodeLlama-13b throughput by over 3x with a prefix length of 1K, and by over 30x with a prefix length of 16K. Hydragen’s efficient processing of long shared contexts lead to only a 15% drop in throughput as the sequence length grows by 16x. We extend Hydragen beyond simple prefix-suffix decomposition and apply it to hierarchical sharing patterns, which allows us to further reduce inference time on competitive programming problems by a further 55%.

Submission Number: 71

Loading