Keywords: Efficient LLM Serving, LLM Serving System, LLM Prefix Caching
TL;DR: Cake optimizes long-context LLM processing with prefix caching by efficiently balancing compute and I/O, simultaneously loading KV cache from disk and perform prefill operations to reduce response times.
Abstract: Recent advancements in Large Language Models (LLMs) have significantly in-
creased context window sizes, enabling sophisticated applications but also in-
troducing substantial computational overheads, particularly computing key-value
(KV) cache in the prefill stage. Prefix caching has emerged to save GPU power
in this scenario, which saves KV cache at disks and reuse them across multiple
queries. However, traditional prefix caching mechanisms often suffer from sub-
stantial latency because the speed of loading KV cache from disks to GPU mem-
ory is bottlenecked by the throughput of I/O devices. To optimize the latency of
long-context prefill, we propose Cake, a novel KV cache loader, which employs
a bidirectional parallelized KV cache generation strategy. Upon receiving a pre-
fill task, Cake simultaneously and dynamically loads saved KV cache from prefix
cache locations and computes KV cache on local GPUs, maximizing the utiliza-
tion of available computation and I/O bandwidth resources. Additionally, Cake
automatically adapts to diverse system statuses without manual parameter. tuning.
In experiments on various prompt datasets, GPUs, and I/O devices, Cake offers
up to 68.1% Time To First Token (TTFT) reduction compare with compute-only
method and 94.6% TTFT reduction compare with I/O-only method.
Primary Area: infrastructure, software libraries, hardware, systems, etc.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 586
Loading