Zero in on Faithful Anchors: High-Fidelity Visual Token Condensation for Multimodal Large Language Models

03 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Token Pruning, MLLMs, Efficient Inference
Abstract: Multimodal Large Language Models (MLLMs) have demonstrated impressive visual reasoning capabilities, but their scalability is limited by the computational burden of processing massive visual tokens. To alleviate this bottleneck, many studies have focused on visual token pruning strategies, which utilize cross-attention or [\texttt{CLS}] attention to identify and retain informative visual tokens. In this work, we uncover a critical limitation of such pruning approaches, \textit{i.e.}, they tend to either omit or pay much attention to the background context within images, resulting in potential semantic distortion. To solve this problem, we introduce CondenseVLM, a dynamic token compression framework for HiFi and efficient MLLM inference, that enhances the information density of retained visual tokens. In particular, CondenseVLM employs a three-stage method: it first selects high-attention tokens as faithful anchors to preserve fine-grained semantics, then compensates important background tokens, and finally merges the retained tokens based on spatial proximity and semantic similarity to ensure view integrity. This synergistic optimization of semantic uniqueness, spatial coverage, and contextual integrity makes CondenseVLM capable of high-fidelity compression. Extensive experiments demonstrate that CondenseVLM can prune up to 88.9\% of visual tokens with merely a \underline{3\%} performance drop, and 77.8\% with just a \underline{1.2\%} drop. Moreover, it integrates seamlessly with efficient attention operators during decoding, delivering substantial speedups and memory savings. The code will be released.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1315
Loading