Keywords: Large Language Models, KV Cache Compression, Memory Efficiency
TL;DR: We propose ZSMerge, a zero-shot KV cache compression method that achieves 82% memory reduction and 3x throughput improvement for long-context LLMs without performance degradation or model retraining.
Abstract: The linear growth of key-value (KV) cache memory and quadratic computational complexity in attention mechanisms pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization methods address these challenges through token pruning or feature merging, they often incur irreversible information loss or require costly retraining. To this end, we propose ZSMerge, a dynamic KV cache compression framework designed for efficient cache management, featuring three key operations: (1) fine-grained memory allocation guided by multi-dimensional token importance metrics at head-level granularity, (2) a residual merging mechanism that preserves critical context through compensated attention scoring, and (3) a zero-shot adaptation mechanism compatible with diverse LLM architectures without requiring retraining. ZSMerge significantly enhances memory efficiency and inference speed. When applied to LLaMA2-7B, it demonstrates a 20:1 compression ratio for key-value cache retention (reducing memory footprint to 5% of baseline) while sustaining generation quality and achieving a 2.25× throughput improvement at extreme 54k-token contexts, eliminating out-of-memory failures. The code is available at https://anonymous.4open.science/r/ZSMerge-FC36.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15503
Loading