Keywords: LVLM, efficiency, KV cache
Abstract: Key-Value (**KV**) cache has become a _de facto_ component of modern Large Vision-Language Models (**LVLM**s) for inference.
While it enhances decoding efficiency in Large Language Models (**LLMs**), its direct adoption in LVLMs introduces substantial GPU memory overhead due to the large number of vision tokens processed during the prefill stage.
To tackle this problem, we propose LightKV, a novel approach that reduces KV cache size by exploiting the redundancy among vision-token embeddings.
Guided by text prompts, LightKV employs cross-modality message passing to aggregate informative messages across vision tokens and progressively compress them during prefill.
This prompt-aware guidance distinguishes our method from prior vision-only compression strategies.
We evaluate LightKV on eight open-source LVLMs across eight public benchmarks, such as MME and SeedBench.
Experimental results demonstrate that with only 50% of the original vision tokens, LightKV (i) halves KV cache size, (ii) reduces computation by up to 40%, and (iii) preserves general-purpose performance while significantly outperforming existing baselines.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12077
Loading