MI-Pruner: Crossmodal Mutual Information-guided Token Pruner for Efficient MLLMs

TMLR Paper9187 Authors

24 May 2026 (modified: 31 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: In multimodal large language models (MLLMs), visual tokens are characterized by their high volume and inherent sparsity compared with the text counterparts. To achieve efficient inference with controllable token budgets, training-free token pruning techniques emerge for their versatility and near-zero cost. Current methods typically measure token importance based on attention salience in the visual encoder or the LLM decoder, then preserve visual tokens with high attention scores while pruning others. However, attention salience is often biased by sink tokens and positional bias. These salience-based methods require extracting attention maps, which introduces implementation complexity and memory overhead, while inadequately accounting for the diversity of selected tokens. In this paper, we pursue a sound and surgical approach, called MI-Pruner, which detours attention collection and instead estimates Mutual Information (MI) based relevance in the projection space. This allows an explicit measure of feature-level dependency with information-theoretic motivation to identify the most informative tokens. Without reliance on internal attention maps or architectural modifications, MI-Pruner can be seamlessly applied to off-the-shelf MLLMs for inference acceleration. Extensive experiments on LLaVA1.5, Qwen-series and Video-LLaVA demonstrate that our approach achieves a favorable performance-efficiency trade-off across diverse image and video understanding benchmarks.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Farzan_Farnia1
Submission Number: 9187
Loading