Keywords: KV Cache; Large Language Model(LLM); Reinforcement Learning; Inter-layer Compression;
Abstract: The inference of large language models typically relies on autoregressive generation. Caching intermediate key-value (KV) pairs can eliminate redundant computations, yet the substantial memory overhead introduced by multi-layer KV cache imposes a new bottleneck. Existing compression approaches operate either within or across layers, respectively suffering from limited optimization flexibility and static configuration strategies. In this paper, we adopt Rényi entropy to characterize the information distribution in cached KV pairs, revealing significant and irregular fluctuations that vary across both inputs and layers. Motivated by this observation, we propose BanditKV, a dynamic inter-layer compression framework built on a two-phase optimization mechanism. A contextual bandit-based policy firstly adaptively selects the optimal layer-grouping configuration for each input. Secondly, Rényi entropy guides a non-uniform, layer-specific memory allocation scheme within each group. Besides, we introduce a lightweight randomized SVD that enables compressing factor matrices derived from KV tensors, rather than the original tensors, to further improve the compression ratio. Extensive experiments show that BanditKV achieves an overall success with up to 16x compression ratio, 2.2x speedup factor, and nearly zero-loss in inference quality.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: LLM Efficiency
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 4629
Loading