Spot the Critical Words: Text-Guided Visual Token Pruning for Efficient Large Vision-Language Model Inference

Di Lu; Xin Zou; Jialong Qin; Yibo Yan; Linfeng Zhang; Xuming Hu

Spot the Critical Words: Text-Guided Visual Token Pruning for Efficient Large Vision-Language Model Inference

Di Lu, Xin Zou, Jialong Qin, Yibo Yan, Linfeng Zhang, Xuming Hu

03 Sept 2025 (modified: 24 Dec 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LVLMs, Token Pruning, Efficient Inference

Abstract: The computational efficiency of Large Vision-Language Models (LVLMs) is severely hampered by the processing overhead of massive visual tokens. While token pruning emerges as a promising solution, prevailing methods that rely on text-visual cross-attention suffer from attention shift, a phenomenon where attention maps fail to accurately localize instruction-relevant regions, retaining significant visual redundancy. To address this issue, we propose TextScythe, a intuitive yet potent pruning framework that first identifies vision-critical text tokens through an entropy-based analysis of cross-modal cosine similarity, effectively distilling user's instructions. It then selects visual tokens exhibiting outlier-level similarity to these critical text tokens. To preserve contextual completeness, a diversity-aware mechanism supplements background tokens based on their intrinsic attention scores. Extensive experiments show that TextScythe achieves state-of-the-art performance across various benchmarks, enabling an extreme 88.9\% token reduction in LLaVA while retaining 96.6\% of the original accuracy, thereby establishing an efficient and effective deployment paradigm for LVLMs. The code will be released.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 1292

Loading