Joint Encoding of KV-Cache Blocks for Scalable LLM Serving

05 Feb 2026 (modified: 29 Apr 2026)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Modern large language models (LLMs) drive interactive AI systems but are bottlenecked by the memory-heavy growth of key-value (KV) caches, which limits real-time throughput under concurrent loads. Existing KV-cache compression methods rely on rigid heuristics, disrupt tensor layouts, or require specialized compute, hindering scalability and deployment. We propose joint encoding of KV-cache blocks, which fuses similar blocks across requests and input chunks into shared representations while preserving standard cache structure. This alleviates the KV-cache memory bottleneck, supporting high-concurrency serving without specialized hardware. Theoretically, we analyze the rate-distortion tradeoff of fused cache blocks under a Poisson process model. Empirically, our method achieves up to 4.38$\times$ KV-cache compression with negligible accuracy loss across diverse LLMs and benchmarks, outperforming recent structured and adaptive compression baselines. In real LLM serving, joint encoding improves the token throughput by $\sim$40\% on a single-machine vLLM benchmark, demonstrating substantial gains in inference throughput. Code is available at \href{https://anonymous.4open.science/r/kv_joint_encoding-55B0/}{\nolinkurl{kv_joint_encoding-55B0}}.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Tal_Schuster1
Submission Number: 7350
Loading