Keywords: Large Reasoning Models, Key-Value Caching, Efficient Inference, Memory Optimization, KV Cache Compression, Structural Sparsity, Expander Graphs, Long Context Reasoning
TL;DR: As LRMs are deployed in limited-resource settings, KV cache becomes a bottleneck for long-context reasoning. We introduce structured sparsity via expander graphs into hybrid KV cache compression, preserving information flow and contextual fidelity.
Abstract: Large Reasoning Models (LRMs) use Key-Value (KV) caching to speed up autoregressive decoding by reusing previously computed attention states for long contexts. However, KV caches grow linearly with sequence length, quickly saturating GPU memory and becoming a bottleneck for long-context reasoning. Prior work, such as GEAR (GEnerative Inference with Approximation Error Reduction), compresses KV caches by combining low-bit quantization, sparse outlier handling, and low-rank approximation. We propose GEAR-X, a drop-in modification that replaces unstructured magnitude-based outlier selection with structured sparsity via expander graphs. This design provides spectral guarantees to preserve connectivity and information flow under aggressive compression, improving the fidelity of the compressed cache without retraining. Our preliminary experiments on GSM8k, AQuA, BBH and LongBench benchmarks show that GEAR-X can achieve competitive or improved accuracy compared to standard GEAR, while maintaining significant memory savings.
Submission Number: 14
Loading