BanditKV: Dynamic Inter-layer Compression for Memory-Efficient KV Cache in LLM Inference

BanditKV: Dynamic Inter-layer Compression for Memory-Efficient KV Cache in LLM Inference

ACL ARR 2026 January Submission4629 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: KV Cache; Large Language Model(LLM); Reinforcement Learning; Inter-layer Compression;

Abstract: The inference of large language models typically relies on autoregressive generation. Caching intermediate key-value (KV) pairs can eliminate redundant computations, yet the substantial memory overhead introduced by multi-layer KV cache imposes a new bottleneck. Existing compression approaches operate either within or across layers, respectively suffering from limited optimization flexibility and static configuration strategies. In this paper, we adopt Rényi entropy to characterize the information distribution in cached KV pairs, revealing significant and irregular fluctuations that vary across both inputs and layers. Motivated by this observation, we propose BanditKV, a dynamic inter-layer compression framework built on a two-phase optimization mechanism. A contextual bandit-based policy firstly adaptively selects the optimal layer-grouping configuration for each input. Secondly, Rényi entropy guides a non-uniform, layer-specific memory allocation scheme within each group. Besides, we introduce a lightweight randomized SVD that enables compressing factor matrices derived from KV tensors, rather than the original tensors, to further improve the compression ratio. Extensive experiments show that BanditKV achieves an overall success with up to 16x compression ratio, 2.2x speedup factor, and nearly zero-loss in inference quality.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: LLM Efficiency

Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 4629

Loading