Make Your LVLM KV Cache More Lightweight

Published: 01 May 2026, Last Modified: 01 May 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Key-Value (KV) cache has become a de facto component of modern Large Vision-Language Models (LVLMs) for inference. While it enhances decoding efficiency in Large Language Models (LLMs), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmark datasets, e.g., MME and SeedBench. Experimental results demonstrate that with only 55% of the original vision tokens, LightKV (a) halves the vision-token KV cache size, (b) reduces computation by up to 40%, and (c) preserves general-purpose performance while significantly outperforming existing baselines.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We thank the AE and all reviewers for their constructive feedback, which has helped improve the quality and clarity of our paper. In the final revision, we have made the following updates: - **Section 3.2:** - Expanded the description of the bipartite graph construction. - Clarified the construction of the candidate set $\mathcal{T}_\rho$, including the possibility of many-to-one aggregation. - Improved notational consistency and clarity. - Provided clearer definitions of the scheduling hyperparameters. - **Section 3.3:** - Revised the complexity analysis to ensure consistency with previously defined notation. - **Section 4.3:** - Added experiments comparing our prompt-guided weighting strategy to uniform and random variants (Table 4). - Added experiments comparing the proposed hierarchical strategy with global-only and local-only variants (Table 7). - **Appendix A.1:** - Included a summary table of notations used throughout the paper (Table 9). - **Appendix A.3:** - Added comparisons between bipartite and full pairwise matching, along with analysis of the performance-efficiency trade-off. - Included additional experiments demonstrating robustness to scheduling parameters (Table 15). - Added comparisons with tuned FastV baselines (Table 18).
Code: https://github.com/howtoosee/LightKV
Assigned Action Editor: ~Chenyu_You1
Submission Number: 7409
Loading