Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

Siyan Zhao; Daniel Mingyi Israel; Guy Van den Broeck; Aditya Grover

Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

Siyan Zhao, Daniel Mingyi Israel, Guy Van den Broeck, Aditya Grover

Published: 21 Jun 2024, Last Modified: 24 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM inference efficiency, prefilling, bin packing

TL;DR: We propose “Prepacking”, an approach for optimizing LLM inference by reducing prefilling overhead with a bin-packing algorithm, significantly boosting speed and memory efficiency for variable-length prompts.

Abstract: During inference for transformer-based LLMs, prefilling computes the key-value (KV) cache for prompt input tokens before autoregressive generation. This work highlights a pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length. As LLMs support longer context lengths, variations in prompt lengths within a batch become more pronounced. To address this, we propose Prepacking, a simple yet effective method to optimize prefilling computation. Prepacking combines prompts of varying lengths into a sequence and packs multiple sequences into a compact batch using a bin-packing algorithm, then modifies the attention mask and positional encoding to compute multiple prefilled KV-caches within a single sequence. On standard datasets with varying prompt lengths, our method significantly improves speed and memory efficiency compared to default padding-based prefilling in Huggingface across various model configurations and inference scenarios.

Submission Number: 70

Loading