Spot the Critical Words: Text-Guided Visual Token Pruning for Efficient Large Vision-Language Model Inference
Keywords: LVLMs, Token Pruning, Efficient Inference
Abstract: The computational efficiency of Large Vision-Language Models (LVLMs) is severely hampered by the processing overhead of massive visual tokens. While token pruning emerges as a promising solution, prevailing methods that rely on text-visual cross-attention suffer from attention shift, a phenomenon where attention maps fail to accurately localize instruction-relevant regions, retaining significant visual redundancy. To address this issue, we propose TextScythe, a intuitive yet potent pruning framework that first identifies vision-critical text tokens through an entropy-based analysis of cross-modal cosine similarity, effectively distilling user's instructions. It then selects visual tokens exhibiting outlier-level similarity to these critical text tokens. To preserve contextual completeness, a diversity-aware mechanism supplements background tokens based on their intrinsic attention scores. Extensive experiments show that TextScythe achieves state-of-the-art performance across various benchmarks, enabling an extreme 88.9\% token reduction in LLaVA while retaining 96.6\% of the original accuracy, thereby establishing an efficient and effective deployment paradigm for LVLMs. The code will be released.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 1292
Loading