Make Your LVLM KV Cache More Lightweight

Xihao Chen; Yangyang Guo; Roger Zimmermann

Make Your LVLM KV Cache More Lightweight

Xihao Chen, Yangyang Guo, Roger Zimmermann

18 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LVLM, efficiency, KV cache

Abstract: Key-Value (**KV**) cache has become a _de facto_ component of modern Large Vision-Language Models (**LVLM**s) for inference. While it enhances decoding efficiency in Large Language Models (**LLMs**), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage. To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings. Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill. This prompt-aware guidance distinguishes our method from prior vision-only compression strategies. We evaluate LightKV on eight open-source LVLMs across eight public benchmarks, such as MME and SeedBench. Experimental results demonstrate that with only 50% of the original vision tokens, LightKV (i) halves KV cache size, (ii) reduces computation by up to 40%, and (iii) preserves general-purpose performance while significantly outperforming existing baselines.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 12077

Loading