Compute Or Load KV Cache? Why Not Both?

Shuowei Jin; Xueshen Liu; Qingzhao Zhang; Zhuoqing Mao

Compute Or Load KV Cache? Why Not Both?

Shuowei Jin, Xueshen Liu, Qingzhao Zhang, Zhuoqing Mao

13 Sept 2024 (modified: 24 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient LLM Serving, LLM Serving System, LLM Prefix Caching

TL;DR: Cake optimizes long-context LLM processing with prefix caching by efficiently balancing compute and I/O, simultaneously loading KV cache from disk and perform prefill operations to reduce response times.

Abstract: Recent advancements in Large Language Models (LLMs) have significantly in- creased context window sizes, enabling sophisticated applications but also in- troducing substantial computational overheads, particularly computing key-value (KV) cache in the prefill stage. Prefix caching has emerged to save GPU power in this scenario, which saves KV cache at disks and reuse them across multiple queries. However, traditional prefix caching mechanisms often suffer from sub- stantial latency because the speed of loading KV cache from disks to GPU mem- ory is bottlenecked by the throughput of I/O devices. To optimize the latency of long-context prefill, we propose Cake, a novel KV cache loader, which employs a bidirectional parallelized KV cache generation strategy. Upon receiving a pre- fill task, Cake simultaneously and dynamically loads saved KV cache from prefix cache locations and computes KV cache on local GPUs, maximizing the utiliza- tion of available computation and I/O bandwidth resources. Additionally, Cake automatically adapts to diverse system statuses without manual parameter. tuning. In experiments on various prompt datasets, GPUs, and I/O devices, Cake offers up to 68.1% Time To First Token (TTFT) reduction compare with compute-only method and 94.6% TTFT reduction compare with I/O-only method.

Primary Area: infrastructure, software libraries, hardware, systems, etc.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 586

Loading