An Empirical Study of Attention and Diversity for Adaptive Visual Token Pruning in Multimodal Reasoning Models
Keywords: Multimodal Reasoning, Visual Token Pruning, Adaptive Framework
Abstract: With the rapid progress of Large Reasoning Models (LRMs), interest in multimodal reasoning has grown substantially. However, multimodal reasoning often requires processing a large number of visual tokens, leading to significant computational overhead. To alleviate this issue, recent studies have explored visual token pruning strategies. Most prior works primarily focus on either attention-based or diversity-based pruning methods. However, in-depth analysis of their characteristics and limitations remains largely unexplored. In this work, we conduct thorough empirical analysis using effective rank (erank) as a measure of feature diversity and attention score entropy to investigate visual token processing mechanisms and analyze the strengths and weaknesses of each approach. Our analysis reveals two insights: (1) Attention-based methods demonstrate superior performance on simple images where information is easily concentrated, whereas diversity-based methods excel in handling complex images with distributed features. (2) Analysis using the hallucination dataset (CHAIR) shows that attention-based methods generate more conservative answers with lower hallucination rates compared to diversity-based methods which produce more exploratory responses with higher hallucination tendencies. Motivated by these observations, we propose a novel token pruning framework that adaptively combines the strengths of both methods.
Extensive experiments show that our method delivers consistent high performance with efficient reasoning across both standard benchmarks and hallucination evaluation datasets.
Submission Number: 200
Loading