Joint Encoding of KV-Cache Blocks for Scalable LLM Serving

Joint Encoding of KV-Cache Blocks for Scalable LLM Serving

ICLR 2026 Conference Submission18385 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM serving, joint representation, KV-cache compression

Abstract: Modern large language models (LLMs) drive interactive AI systems but are bottlenecked by the memory-heavy growth of key–value (KV) caches, which limits real-time throughput under concurrent loads. Existing KV-cache compression methods rely on rigid heuristics, disrupt tensor layouts, or require specialized compute, hindering scalability and deployment. We propose joint encoding of KV-cache blocks, which fuses similar blocks across requests and input chunks into shared representations while preserving standard cache structure. This alleviates the KV-cache memory bottleneck, supporting high-concurrency serving without specialized hardware. Theoretically, we analyze the rate–distortion tradeoff of fused cache blocks under a Poisson process model. Empirically, our method achieves up to 4.38× KV-cache compression with negligible accuracy loss across diverse LLMs and benchmarks, outperforming recent structured and adaptive compression baselines. Our results establish a scalable, plug-and-play pathway for memory-efficient, high-throughput autoregressive inference. Code is available at \href{https://anonymous.4open.science/r/kv_joint_encoding-55B0/}{\nolinkurl{kv_joint_encoding-55B0}}.

Primary Area: optimization

Submission Number: 18385

Loading