Keywords: Quantum Machine Learning, Quantum Circuit, Large Language Model, KV Compression, Optimization
TL;DR: QubitCache achieves 10× KV-cache compression by preserving attention relationships through quantum-inspired amplitude encoding instead of binary token selection.
Abstract: Large language model inference suffers from quadratic KV cache memory growth that fundamentally limits long context applications. Existing compression methods achieve memory reduction through token eviction but irreversibly discard relational information essential for complex reasoning. We present QubitCache, the first framework recognizing that attention patterns between tokens constitute the primary information carrier in transformers, not tokens themselves. This insight motivates a paradigm shift from discrete token selection to continuous relational preservation through quantum-inspired encoding. QubitCache introduces a hybrid architecture where critical tokens remain in classical storage while attention patterns undergo amplitude encoding into quantum states, achieving logarithmic compression beyond classical information-theoretic limits. Unlike binary dcisions, our framework generates probabilistic attention distributions through quantum state measurements, maintaining contextual coherence via soft attention constraints. We prove QubitCache preserves rank $r$ attention structure with bounded reconstruction error, ensuring graceful degradation rather than catastrophic failure. Empirical evaluation demonstrates $7\times$ memory reduction while maintaining 92-97\% of baseline performance across five models and six benchmarks. Remarkably, QubitCache achieves this with only 15\% token retention compared to 50\% in existing SOTA methods, yet attains 15-25\% higher F1 scores on multi-hop reasoning tasks.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 10775
Loading