Keywords: KV cache compression, inference efficiency, multimodal large language models
Abstract: Key-Value (KV) caching is essential for efficient inference in multimodal large language models (MLLMs), yet its memory footprint grows linearly with context length and becomes a major bottleneck due to the large number of visual tokens.
Recent prefill-only KV selection methods estimate KV importance from prefilling statistics, implicitly assuming that prefilling-time queries are representative of those encountered during decoding.
We show that this assumption breaks down in multimodal inference, where decoding-time queries exhibit substantially larger variance than prefilling-stage representations, leading to unstable KV importance estimation under tight cache budgets.
As a result, small ranking errors can disproportionately discard semantically critical visual tokens and degrade grounding and reasoning performance.
We propose MM-ShiftKV, a training-free and strictly prefill-only KV selection method that is explicitly decode-aware.
MM-ShiftKV approximates decoding-time query behavior during prefilling by
constructing variance-expanded query proxies and estimates prompt KV
importance based on their aggregated attention mass.
Experiments on multimodal benchmarks demonstrate that MM-ShiftKV consistently outperforms existing KV selection methods under strict KV-cache budgets.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: LLM efficiency,inference efficiency,memory-efficient inference,multimodal inference
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: english
Submission Number: 9842
Loading