Deep Low Rank Projector for KV Cache Compression

Deep Low Rank Projector for KV Cache Compression

ICLR 2026 Conference Submission17276 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model, KV Cache Compression, Deep Linear Network

Abstract: Large Language Models (LLMs) have become integral to a wide range of natural language processing tasks. A key component enabling fast autoregressive inference in LLMs is the Key-Value~(KV) cache, which stores hidden states across decoding steps. However, the KV cache imposes substantial memory overhead, especially in long-context generation. While recent studies have proposed various compression techniques to mitigate this issue, they largely overlook the interaction between the techniques Parameter-Efficient Fine-Tuning (PEFT) methods such as LoRA, under which models are significantly more sensitive to KV cache compression. To address this issue, we propose the Deep Low-Rank Projector (DLRP), a novel adapter that compresses the KV cache along the head dimension while preserving downstream performance in PEFT-adapted models. We introduce the Deep Linear Projector (DLP), which is realized leveraging a Deep Linear Network (DLN). We also propose a novel regularizer that approximates the nuclear norm of the DLP, thereby promoting low-rank structure in the learned projection. After training with the proposed regularizer, we inspect the singular‑value spectrum and select the minimum rank satisfying a predefined energy threshold, yielding a compact head dimension that balances compression and accuracy. Based on this rank, we construct the DLRP, fine-tune it on the target task, and merge its factorized layers into a single linear operator for efficient inference. Empirical evaluation confirms that DLRP achieves substantial KV cache compression while maintaining strong performance across diverse LLM benchmarks, offering a practical solution for deploying PEFT-adapted models in memory-constrained settings.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 17276

Loading