An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models
Keywords: Multimodal Large Language Models, Visual Token Pruning
Abstract: Large Vision-language models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Attention-based methods demonstrate superior performance on simple images where information is easily concentrated, whereas diversity-based methods excel in handling complex images with distributed features. (2) Analysis using the hallucination dataset (CHAIR) shows that attention-based methods generate more conservative answers with lower hallucination rates compared to diversity-based methods which produce more exploratory responses with higher hallucination tendencies. Motivated by these observations, we propose a novel token pruning framework that adaptively combines the strengths of both methods. Extensive experiments show that our method delivers consistent high performance across both standard benchmarks and hallucination evaluation datasets.
Our project page available at https://anonymous.4open.science/w/AdaVTP-186A/
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16166
Loading