KV-Cache Transform Coding for Compact Storage in LLM Inference

KV-Cache Transform Coding for Compact Storage in LLM Inference

ICLR 2026 Conference Submission19321 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: transformer, kv cache, compression

TL;DR: We present KVTC, a lightweight transform coder that allows for extended retention of transformer KV-cache via compression.

Abstract: Serving large language models (LLMs) at scale necessitates efficient key-value (KV) cache management. KV caches can be reused across conversation turns via shared-prefix prompts that are common in iterative code editing and chat. However, stale caches consume scarce GPU memory, require offloading, or force recomputation. We present KVTC, a lightweight transform coder that compresses KV caches for compact on-GPU and off-GPU storage. Drawing on classical media compression, KVTC combines PCA-based feature decorrelation, adaptive quantization, and entropy coding. It requires only a brief initial calibration and leaves model parameters unchanged. By exploiting redundancies in KV caches, KVTC achieves up to 20x compression, while maintaining reasoning and long-context accuracy in Llama 3.1, Mistral-NeMo, and R1-Qwen 2.5. Across benchmarks including AIME25, LiveCodeBench, GSM8K, MMLU, Qasper, RULER, and MATH500, KVTC consistently outperforms inference-time baselines such as token eviction, quantization, and SVD-based methods, delivering substantially higher compression ratios. These results support KVTC as a practical building block for memory-efficient LLM serving with reusable KV caches.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 19321

Loading