InfoPrune: Revisiting Visual Token Pruning from an Information-Theoretic Perspective

16 Sept 2025 (modified: 26 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Visual Token Pruning, Vision Token Pruning, MLLMs
Abstract: Multimodal large language models (MLLMs) rely on dense visual tokens, but their indiscriminate propagation causes severe inference overhead. Existing pruning strategies largely treat token importance as a static property (e.g., attention strength), overlooking the dynamic nature of evidence flow. In this work, we recast pruning as an information budgeting problem: under limited computation, which tokens provide genuine marginal information, and when has their contribution been fully injected into the language stream? Guided by this formulation, we propose InfoPrune, a training-free two-stage framework. Stage 1 refines visual token selection by combining attention priors with structure-aware incremental metrics, while Stage 2 detects mid-layer semantic convergence and performs one-shot pruning within the LLM. This design directly targets “who to keep” and “when to stop,” reducing redundancy while preserving essential semantics. Experiments on LLaVA-1.5, LLaVA-Next, and Qwen-VL-2.5 show that InfoPrune achieves over 96\% performance retention with only 11.1% tokens, outperforming prior methods in generality, stability, and efficiency. Our work provides both a principled perspective on multimodal evidence budgeting and a practical solution for efficient inference.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 6680
Loading