LightCache: Efficient Inference for Transformers via KV Cache Compression in Feature Dimension

LightCache: Efficient Inference for Transformers via KV Cache Compression in Feature Dimension

ACL ARR 2024 June Submission5284 Authors

16 Jun 2024 (modified: 02 Jul 2024)ACL ARR 2024 June SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: KV Cache Optimization is a crucial topic in improving the inference efficiency and length extrapolation of Transformer-based Large Language Models (LLMs). Previous KV Cache Optimization approaches often focus on pruning or compressing the sequence dimension, leading to an irreversible loss of contextual information. In this work, we propose LightCache, a novel KV Cache optimization approach that operates on the feature dimension. LightCache employs parameter-aware compression and full-context cache selection, allowing it to reduce memory usage and enhance computational efficiency without sacrificing contextual information. Importantly, LightCache enables a training-free integration with LLMs. Experiments demonstrate that LightCache outperforms classic extrapolation, quantization, and context-pruning methods in long-context evaluation. In terms of efficiency, LightCache reduces the KV Cache size by over 60\% and achieves 1.7$\sim$2.4$\times$ memory efficiency as well as 1.5$\sim$3.6$\times$ speedup in 32K context length.

Paper Type: Long

Research Area: Special Theme (conference specific)

Research Area Keywords: Efficiency in Model Inference

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency

Languages Studied: English, Chinese

Submission Number: 5284

Loading