R-KVHash: Reasoning Model KV Cache Compression Via SimHash-based Estimation of Redundant Tokens
Keywords: reasoning, kv cache, simhash, lsh, compression
TL;DR: We propose a KV cache compression algorithm which efficiently determines redundant tokens in reasoning traces using the SimHash.
Abstract: Reasoning models excel on benchmarks which benefit from multi-step reasoning. However, these reasoning traces are often excessively verbose, producing outputs that contain tens-of-thousands of tokens. The resulting key-value (KV) cache, which stores past token embeddings, grows linearly with sequence length. The R-KV compression algorithm addresses this issue by evicting tokens from the cache associated with redundant self-reflection and reasoning steps. However, this per-token redundancy estimation relies on calculating pairwise key cosine similarities, thus necessitating a Gram matrix product of the key cache, along with an accumulated attention score calculation. The associated memory and computational complexity of these operations become expensive with increasing budget and context size. We propose R-KVHash, which uses locality-sensitive hashing, namely the SimHash, to efficiently estimate the key similarities with sub-linear memory and computational complexity, and entirely avoids computation of attention-based importance. We find that our approach, which buckets keys through a binarized Gaussian projection, exhibits up to 2$\times$ higher decoding throughput than R-KV and is also competitive in accuracy on MATH500 and GSM8K for DeepSeek-R1-Distill-Qwen 7B and 14B.
Submission Number: 97
Loading