Unbiased Token Pruning for Efficient Large Multimodal Models

16 Sept 2025 (modified: 14 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large multimodal models, Visual token pruning
TL;DR: A visual token pruning method for Large multimodal models
Abstract: Large Multimodal Models (LMMs) have demonstrated exceptional capabilities and drawn increasing attention. However, their substantial computational cost poses significant challenges for real-world applications. A considerable portion of this arises from the lengthy sequences of image tokens, bringing quadratically increasing computations due to the Transformer architecture. In light of this, recent works have explored visual token pruning for higher efficiency. Despite effective, they generally suffer from the inaccurate importance estimation (\ie, what to prune) and the suboptimal pruning layers (\ie, where to prune). This leads to notable visual information loss and inferior performance. In this work, we present an Unbiased Token Pruning (UTP) method to tackle these issues. For what to prune, we introduce an Unbiased Relevance Estimation (URE) strategy, which disentangles the interference of position embedding for more accurate importance assessment of visual tokens. For where to prune, we propose an Unbiased Token Retention (UTR) strategy, which solves the optimal pruning scheme by formulating the objective of minimizing the information loss as an integer linear programming problem. Extensive experiments demonstrate that our method outperforms existing state-of-the-art works and exhibits favorable performance in various tasks, showing its superiority for efficient inference of LMMs. Code will be publicly available.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7462
Loading