Abstract: KV Cache Optimization is a crucial topic in improving the inference efficiency and length extrapolation of Transformer-based Large Language Models (LLMs). Previous KV Cache Optimization approaches often focus on pruning or compressing the sequence dimension, leading to an irreversible loss of contextual information. In this work, we propose LightCache, a novel KV Cache optimization approach that operates on the feature dimension. LightCache employs parameter-aware compression and full-context cache selection, allowing it to reduce memory usage and enhance computational efficiency without sacrificing contextual information. Importantly, LightCache enables a training-free integration with LLMs. Experiments demonstrate that LightCache outperforms classic extrapolation, quantization, and context-pruning methods in long-context evaluation. In terms of efficiency, LightCache reduces the KV Cache size by over 60\% and achieves 1.7$\sim$2.4$\times$ memory efficiency as well as 1.5$\sim$3.6$\times$ speedup in 32K context length.
Paper Type: Long
Research Area: Special Theme (conference specific)
Research Area Keywords: Efficiency in Model Inference
Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency
Languages Studied: English, Chinese
Submission Number: 5284
Loading