An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Large Vision-Language Models

ICLR 2026 Conference Submission16166 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Models, Visual Token Pruning
Abstract: Large Vision-language models (LVLMs) have adopted visual token pruning strategies to mitigate substantial computational overhead incurred by extensive visual token sequences. While prior works primarily focus on either attention-based or diversity-based pruning methods, in-depth analysis of these approaches' characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Attention-based methods demonstrate superior performance on simple images where information is easily concentrated, whereas diversity-based methods excel in handling complex images with distributed features. (2) Analysis using the hallucination dataset (CHAIR) shows that attention-based methods generate more conservative answers with lower hallucination rates compared to diversity-based methods which produce more exploratory responses with higher hallucination tendencies. Motivated by these observations, we propose a novel token pruning framework that adaptively combines the strengths of both methods. Extensive experiments show that our method delivers consistent high performance across both standard benchmarks and hallucination evaluation datasets. Our project page available at https://anonymous.4open.science/w/AdaVTP-186A/
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 16166
Loading