Prune Redundancy, Preserve Essence: Vision Token Compression in VLMs via Synergistic Importance-Diversity
Keywords: efficient vlms, visual token compression
TL;DR: Training-free visual token compression balancing information importance and diversity for efficient VLMs.
Abstract: Vision-language models (VLMs) face significant computational inefficiencies caused by excessive generation of visual tokens. While prior work shows that a large fraction of visual tokens are redundant, existing compression methods struggle to balance \textit{importance preservation} and \textit{information diversity}. To address this, we propose $\textbf{PruneSID}$, a training-free Synergistic Importance-Diversity approach featuring a two-stage pipeline: (1) Principle Semantic Components Analysis (PSCA) for clustering tokens into semantically coherent groups, ensuring comprehensive concept coverage, and (2) Intra-group Non-Maximum Suppression (NMS) for pruning redundant tokens while preserving key representative tokens within each group. Additionally, $\textbf{PruneSID}$ incorporates an information-aware dynamic compression ratio mechanism that optimizes token compression rates based on image complexity, enabling more effective average information preservation across diverse scenes. Extensive experiments demonstrate state-of-the-art performance, achieving $\textbf{96.3}$% accuracy on LLaVA-1.5 with only $\textbf{11.1}$% token retention, and $\textbf{92.8}$% accuracy at extreme compression rates ($\textbf{5.6}$%) on LLaVA-NeXT, outperforming prior methods by $\textbf{2.5}$% with $\textbf{7.8}$x faster prefilling speed compared to the original model. Our framework generalizes across diverse VLMs and both image and video modalities, showcasing strong cross-modal versatility.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12764
Loading