xKV: Cross-Layer KV-Cache Compression via Aligned Singular Vector Extraction

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: KV-Cache Compression, Large Language Model
TL;DR: We propose a novel KV-Cache compression method based on cross-layer SVD
Abstract: Large Language Models (LLMs) with long context windows enable powerful applications but come at the cost of high memory consumption to store the key and value states (KV-Cache). Recent studies attempted to merge KV-Caches from multiple layers into shared representations, yet these approaches either require expensive pretraining or rely on per-token cosine similarity across layers, which may not always be observed in practice. We find that the dominant singular vectors are remarkably well-aligned across multiple layers of the KV-Cache. Exploiting this insight, we propose xKV, a post-training compression method that applies Singular Value Decomposition (SVD) on the KV-Cache of grouped layers. xKV consolidates the KV-Cache of multiple layers into a shared low-rank subspace, significantly reducing KV-Cache sizes. Through extensive evaluations on the RULER long-context benchmark with widely-used LLMs (e.g., Llama-3.1 and Qwen2.5), xKV achieves up to 8× KV-Cache compression rate while keeping the accuracy gap within 2–3 percentage points of the non-compressed baseline over a set of representative long-context tasks, and remains robust in multi-turn settings. Coupled with the designed Selective Reconstruction (SR) at decode time, xK-SR (keys only, values offloaded to CPU memory) yields 2.53% higher accuracy than the state-of-the-art system that combined token selection with single-layer SVD and delivers up to 3.23× end-to-end speedups over full attention on an A100 GPU. At a similar accuracy level, xKV-SR (keys and values on GPU) achieves up to 4.23× faster speedups. These results highlight xKV as a versatile, plug-and-play solution to alleviate both the memory footprint and accelerate inference in long-context LLM inference.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 9770
Loading