KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation

KVPR: Efficient LLM Inference with I/O-Aware KV Cache Partial Recomputation

ACL ARR 2024 December Submission1157 Authors

15 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Inference for Large Language Models (LLMs) is computationally demanding. To reduce the cost of auto-regressive decoding, Key-Value (KV) cache is used to store intermediate activations, which significantly lowers the computational overhead for token generation. However, the memory required for KV cache grows rapidly, often exceeding the capacity of GPU memory. A cost-effective alternative is to offload KV cache to CPU memory, which alleviates GPU memory pressure but shifts the bottleneck to the limited bandwidth of the PCIe connection between the CPU and GPU. Existing methods attempt to address these issues by overlapping GPU computation with I/O or employing CPU-GPU heterogeneous execution, but they are hindered by excessive data movement and dependence on CPU capabilities. Fully overlapping PCIe communication latency gets challenging as the size of KV cache grows and/or the GPU compute capabilities increase. In this paper, we introduce KVPR, an efficient I/O-aware LLM inference method where the CPU first transfers a partial set of activations from which the GPU can start recomputing the KV cache values. While the GPU recomputes the partial KV cache the remaining portion of the KV cache is concurrently transferred from the CPU. This approach overlaps GPU recomputation with KV cache transfer to minimize idle GPU time and maximize inference performance. KVPR is fully automated by integrating a profiler module that utilizes input characteristics and system hardware information, a scheduler module to optimize the distribution of computation and communication workloads, and a runtime module to efficiently execute the derived execution plan. Experimental results show that KVPR achieves up to 35.8% lower latency and 46.2% higher throughput during decoding compared to state-of-the-art approaches. The code is available at https://anonymous.4open.science/r/KVPR-C176/.

Paper Type: Long

Research Area: Efficient/Low-Resource Methods for NLP

Research Area Keywords: NLP in resource-constrained settings

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 1157

Loading