TL;DR: We introduce a novel KV cache compression method that leverages sparse coding with a universal dictionary.
Abstract: We introduce Lexico, a novel KV cache compression method that leverages sparse coding with a universal dictionary. Our key finding is that key-value cache in modern LLMs can be accurately approximated using sparse linear combination from a small, input-agnostic dictionary of ~4k atoms, enabling efficient compression across different input prompts, tasks and models. Using orthogonal matching pursuit for sparse approximation, Lexico achieves flexible compression ratios through direct sparsity control. On GSM8K, across multiple model families (Mistral, Llama 3, Qwen2.5), Lexico maintains 90-95% of the original performance while using only 15-25% of the full KV-cache memory, outperforming both quantization and token eviction methods. Notably, Lexico remains effective in low memory regimes where 2-bit quantization fails, achieving up to 1.7x better compression on LongBench and GSM8K while maintaining high accuracy.
Lay Summary: Modern AI language models like ChatGPT need to remember everything you’ve said so far in a conversation. They store this information in what’s called a "KV cache," which takes up a lot of memory—especially as conversations get longer or more complex. This creates a problem when trying to run these models efficiently on devices with limited memory, such as a single GPU.
Our method, Lexico, reduces this memory cost by compressing the stored information using a technique called sparse coding. Instead of storing all the data in full, Lexico breaks it down into a small set of building blocks (like a dictionary of reusable parts) that can represent the original information with high accuracy. These dictionaries are universal, meaning they work across different tasks without needing to be retrained.
Even under tight memory limits, Lexico maintains strong performance—matching the original model’s accuracy 90–95% of the time while using only 15–25% of the memory. It also outperforms popular alternatives like quantization and token deletion in challenging tasks. This makes Lexico a practical solution for running large language models more efficiently, especially when memory is a bottleneck.
Link To Code: https: //github.com/krafton-ai/lexico
Primary Area: Deep Learning->Large Language Models
Keywords: transformer, kv cache, compression
Submission Number: 592
Loading