Prepacking: A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models

Published: 21 Jun 2024, Last Modified: 24 Jul 2024ES-FoMo-II 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM inference efficiency, prefilling, bin packing
TL;DR: We propose “Prepacking”, an approach for optimizing LLM inference by reducing prefilling overhead with a bin-packing algorithm, significantly boosting speed and memory efficiency for variable-length prompts.
Abstract: During inference for transformer-based LLMs, prefilling computes the key-value (KV) cache for prompt input tokens before autoregressive generation. This work highlights a pitfall of prefilling: for batches containing high-varying prompt lengths, significant computation is wasted by the standard practice of padding sequences to the maximum length. As LLMs support longer context lengths, variations in prompt lengths within a batch become more pronounced. To address this, we propose Prepacking, a simple yet effective method to optimize prefilling computation. Prepacking combines prompts of varying lengths into a sequence and packs multiple sequences into a compact batch using a bin-packing algorithm, then modifies the attention mask and positional encoding to compute multiple prefilled KV-caches within a single sequence. On standard datasets with varying prompt lengths, our method significantly improves speed and memory efficiency compared to default padding-based prefilling in Huggingface across various model configurations and inference scenarios.
Submission Number: 70
Loading