Keywords: Efficient Inference, MLLMs, Token prune.
Abstract: Although Multimodal Large Language Models (MLLMs) excel in visual-language understanding, the quadratic complexity induced by massive visual tokens causes significant computational overhead. Existing visual token pruning strategies often rely on single-dimensional metrics, failing to balance image-intrinsic global context with text-guided relevance or effectively eliminate feature redundancy. To address this, we propose IFD-Prune, a training-free, Plug-and-Play visual token pruning framework. Specifically, we design a dual-criteria importance mechanism that explicitly fuses intrinsic visual saliency and cross-modal text relevance. Furthermore, we formulate visual token pruning as a maximum volumetric information problem, utilizing iterative greedy orthogonal projection to select tokens that span the largest effective hypervolume in the feature space. Extensive experiments demonstrate that IFD-Prune outperforms state-of-the-art methods. Notably, on LLaVA-1.5-7B, our method reduces visual tokens by 88.9% and FLOPs by 63.8% while robustly retaining 96.87% of the original performance, achieving a superior efficiency-accuracy trade-off.
Paper Type: Long
Research Area: LLM Efficiency
Research Area Keywords: LLM Efficiency; pruning;
Contribution Types: Approaches low compute settings-efficiency
Languages Studied: English
Submission Number: 8799
Loading