Compute or Load KV Cache? Why Not Both?

Shuowei Jin; Xueshen Liu; Qingzhao Zhang; Zhuoqing Mao

Compute or Load KV Cache? Why Not Both?

Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Zhuoqing Mao

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Cake is a KV cache loading system that optimally balances computation and I/O resources through a bidirectional scheduling strategy, reducing Time to First Token (TTFT) by 2.6× on average and improving long-context LLM inference efficiency.

Abstract: Large Language Models (LLMs) are increasingly deployed in large-scale online services, enabling sophisticated applications. However, the computational overhead of generating key-value (KV) caches in the prefill stage presents a major bottleneck, particularly for long-context inputs. Prefix caching mitigates this issue by storing KV caches for reuse, reducing redundant computation. Despite its advantages, prefix caching suffers from high latency due to the limited I/O bandwidth of storage devices, constraining inference efficiency. To address this challenge, we introduce Cake, a novel KV cache loading system that optimally utilizes both computational and I/O resources in parallel. Cake employs a bidirectional scheduling strategy that dynamically balances KV cache computation and loading, ensuring efficient resource utilization. Additionally, Cake incorporates an adaptive scheduling mechanism that seamlessly integrates with non-prefix caching requests, improving system throughput and adapting to fluctuating resource availabilty. Through extensive evaluations across various hardware configurations, datasets, and storage conditions, Cake achieves on average 2.6× reduction in Time to First Token (TTFT) compared to compute-only and I/O-only methods. Our findings highlight Cake as an effective and practical solution for optimizing long-context LLM inference, bridging the gap between computation and I/O efficiency in large-scale AI deployments.

Lay Summary: Large Language Models (LLMs) have become vital tools in various online applications, helping users with tasks like answering questions, summarizing documents, and chatting naturally. However, using these models with lengthy text inputs (like an entire book or a long conversation) can be slow, as they need to perform extensive calculations before providing their first response. Previous approaches typically relied solely on loading pre-calculated data from storage, but these methods alone have limitations and often fail to sufficiently reduce the latency experienced by users, especially when storage bandwidth is limited. In this work, we introduce a new approach called "Cake," which smartly combines recomputing calculations and loading pre-stored data simultaneously. Cake efficiently manages resources by recomputing certain data while loading other parts from storage, significantly reducing the waiting time even under constrained storage bandwidth. By dynamically adapting to available computing power and storage bandwidth, Cake effectively handles real-world conditions, enhancing the responsiveness of LLMs. Our extensive experiments demonstrate that Cake consistently accelerates the initial response time, making large language models faster and more practical for everyday applications.

Primary Area: General Machine Learning->Hardware and Software

Keywords: Prefix caching System Optimization, KV Cache, Low-Latency LLM Inference

Submission Number: 1121

Loading