Abstract: Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the sequence length increases, which has become a bottleneck for the application of LLMs on long sequences. Existing KV cache compression methods include eviction, merging, or quantization of the KV cache to reduce its size. However, compression results in irreversible information forgetting, potentially affecting the accuracy of subsequent decoding. In this paper, we propose SpeCache, which takes full advantage of the large and easily expandable CPU memory to offload the complete KV cache, and dynamically fetches KV pairs back in each decoding step based on their importance measured by low-precision KV cache copy in VRAM. To avoid inference latency caused by CPU-GPU communication, SpeCache speculatively predicts the KV pairs that the next token might attend to, allowing us to prefetch them before the next decoding step which enables parallelization of prefetching and computation. Experiments on LongBench and Needle-in-a-Haystack benchmarks verify that SpeCache effectively reduces VRAM usage while avoiding information forgetting for long sequences without re-training, even with a 10x high KV cache compression ratio.
Lay Summary: The KV cache is one of the primary contributors to memory usage during inference in large language models (LLMs). We propose a training-free approach to reduce KV cache memory overhead. Specifically, we offload the original 16-bit KV cache to CPU RAM, and during inference, only fetch the top-k KV entries that are most relevant to the current decoding step into GPU memory.
To enable parallelism between data loading and computation, we introduce a speculative token, which approximates the next output token and is decoded concurrently with the current token. Using this speculative token and its 1-bit or 2-bit precision KV cache replica, we predict the top-k KV entries likely to be required in the next step. This allows us to preload the necessary KV cache entries before they are actually needed, significantly reducing GPU memory usage without introducing noticeable latency overhead.
Primary Area: Deep Learning->Large Language Models
Keywords: Key-Value Cache, Large Language Models
Submission Number: 625
Loading