STP: Smart Token Pruning for Vision-Language Models: Balancing Importance and Diversity

Ali Cheraghian; Zeeshan Hayder; Lars Petersson

STP: Smart Token Pruning for Vision-Language Models: Balancing Importance and Diversity

Ali Cheraghian, Zeeshan Hayder, Lars Petersson

16 Sept 2025 (modified: 17 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Token Pruning

TL;DR: Token pruning in large vision-language models

Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success in multimodal reasoning by jointly processing visual and textual information. However, efficient inference in practical applications remains challenging due to the substantial computational and memory overhead of LVLMs. Existing token pruning strategies often face a trade-off: they either prioritize token importance while neglecting semantic diversity, or enforce diversity at the expense of critical tokens. To overcome this limitation, we propose STP (Smart Token Pruning), a novel framework that balances both objectives. We formulate token pruning as a bi-criteria optimization problem that jointly maximizes semantic diversity, to preserve broad coverage of visual concepts, and token importance, quantified via a new gradient-based saliency score that integrates feature sensitivity and activation strength. STP introduces a unified token selection strategy that adaptively prunes tokens based on their joint diversity-importance score, ensuring both efficient computation and reliable visual-textual reasoning. Extensive experiments across 11 diverse benchmarks show that STP achieves significant reductions in computation and memory usage while maintaining competitive accuracy. This enables scalable and resource-efficient deployment of LVLMs.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 7149

Loading