SparseCache: Extreme Sparse Coding for KV Cache Compression

08 Sept 2025 (modified: 27 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, KV Cache Compression, Dictionary Learning, RAG, K-SVD
TL;DR: SparseCache compresses Key/Value vectors with shared dictionaries, enabling "precomputed RAG" to reduce TTFT and boost LLM inference.
Abstract: The growing memory footprint of the Key-Value (KV) cache is a critical bottleneck in Large Language Models (LLMs), significantly hindering inference efficiency. While emerging "precomputed RAG" paradigms promise to reduce latency by precomputing KV caches for entire corpora, their prohibitive storage requirements render them impractical. This paper introduces SparseCache, a novel KV Cache compression framework that addresses this bottleneck. SparseCache employs an end-to-end learning framework, inspired by the K-SVD algorithm's alternating optimization, to create separate globally shared dictionaries for Key and Value vectors across the model. By optimizing these dictionaries directly against a reconstruction loss objective, SparseCache captures fundamental KV Cache redundancies more holistically than prior per-layer methods. Extensive experiments show that SparseCache achieves a state-of-the-art compression ratio of up to 17.7x while preserving model accuracy on challenging long-context benchmarks. Notably, it maintains high performance at over 8x compression, a level where competing techniques struggle. By enabling high-fidelity compression, SparseCache makes the "precomputed RAG" paradigm practical and feasible, leading to reduced Time-To-First-Token (TTFT) and improved overall system throughput.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 3002
Loading