IFD-Prune: Importance-Fused Diverse Visual Token Pruning for Efficient Vision-Language Inference

IFD-Prune: Importance-Fused Diverse Visual Token Pruning for Efficient Vision-Language Inference

ACL ARR 2026 January Submission8799 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Efficient Inference, MLLMs, Token prune.

Abstract: Although Multimodal Large Language Models (MLLMs) excel in visual-language understanding, the quadratic complexity induced by massive visual tokens causes significant computational overhead. Existing visual token pruning strategies often rely on single-dimensional metrics, failing to balance image-intrinsic global context with text-guided relevance or effectively eliminate feature redundancy. To address this, we propose IFD-Prune, a training-free, Plug-and-Play visual token pruning framework. Specifically, we design a dual-criteria importance mechanism that explicitly fuses intrinsic visual saliency and cross-modal text relevance. Furthermore, we formulate visual token pruning as a maximum volumetric information problem, utilizing iterative greedy orthogonal projection to select tokens that span the largest effective hypervolume in the feature space. Extensive experiments demonstrate that IFD-Prune outperforms state-of-the-art methods. Notably, on LLaVA-1.5-7B, our method reduces visual tokens by 88.9% and FLOPs by 63.8% while robustly retaining 96.87% of the original performance, achieving a superior efficiency-accuracy trade-off.

Paper Type: Long

Research Area: LLM Efficiency

Research Area Keywords: LLM Efficiency; pruning;

Contribution Types: Approaches low compute settings-efficiency

Languages Studied: English

Submission Number: 8799

Loading