Tucker-KV: Provable Tucker Compression of KV Caches with Monotone Refinement and Near-Optimal Budgeting

ICLR 2026 Conference Submission14398 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: KV Cache, Tucker Decomposition, Learning to Compress
Abstract: Key-Value (KV) caches enable fast Transformer decoding but their memory and compute scale linearly with context length. Prior KV compression works are largely matrix low-rank heuristics, leaving multilinear guarantees underexplored. We present Tucker-KV, a Tucker-based framework with provable properties for compressing KV tensors over (L, S, H). Our analysis establishes: (i) HOSVD-style error upper bounds and monotone refinement via HOOI; (ii) grouped-head separability enabling parallelizable compression; (iii) a (1-1/e) guarantee for greedy budget allocation under mild DR-submodularity; and (iv) robust residual mixing with matrix baselines that never increases error when Tucker fits the residual in least squares. We further characterize the budget regime where Tucker-2 is preferable to full Tucker. On Qwen2.5-7B with RULER at 4k, Tucker-KV matches Full-KV quality (EM/F1 ~ 1.00) while saving 83% KV memory, with perplexity unchanged and favorable prefill throughput. Importantly, Tucker-KV is orthogonal to token-selection methods (sliding/streaming/xKV) and can be stacked with them; our focus is the representation-compression axis with provable monotone refinement and near-optimal budget allocation.
Primary Area: optimization
Submission Number: 14398
Loading