Keywords: Token Pruning
TL;DR: Token pruning in large vision-language models
Abstract: Large Vision-Language Models (LVLMs) have achieved remarkable success in multimodal reasoning by jointly processing visual and textual information. However, efficient inference in practical applications remains challenging due to the substantial computational and memory overhead of LVLMs. Existing token pruning strategies often face a trade-off: they either prioritize token importance while neglecting semantic diversity, or enforce diversity at the expense of critical tokens. To overcome this limitation, we propose STP (Smart Token Pruning), a novel framework that balances both objectives. We formulate token pruning as a bi-criteria optimization problem that jointly maximizes semantic diversity, to preserve broad coverage of visual concepts, and token importance, quantified via a new gradient-based saliency score that integrates feature sensitivity and activation strength. STP introduces a unified token selection strategy that adaptively prunes tokens based on their joint diversity-importance score, ensuring both efficient computation and reliable visual-textual reasoning. Extensive experiments across 11 diverse benchmarks show that STP achieves significant reductions in computation and memory usage while maintaining competitive accuracy. This enables scalable and resource-efficient deployment of LVLMs.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 7149
Loading