Keywords: efficient vision-language models, variational inference, token pruning
Abstract: Vision-language models (VLMs) with dynamic resolution vision encoders achieve strong performance, but face significant efficiency challenges due to long input sequences. A common approach is to assess the importance of tokens and prune those that are less informative. Recent methods utilizing a small VLM to provide the importance map of visual tokens have outperformed existing rule-based and similarity-driven pruning approaches, particularly under high pruning ratios. However, directly using the small VLM remains unreliable, as it aggregates cross-attention weights between all the generated answer tokens of the small VLM and the visual inputs to form an importance map, which can lead to noisy guidance if the generated answer is incorrect.
To address this, we invert the approach by having it detect non-informative visual tokens according to the user's input query. By adding a learnable information bottleneck in the small VLM, we can approximate the posterior distribution of non-important visual tokens. This enables the small model to highlight broad informative regions, allowing the large VLM to retain its reasoning capacity with improved efficiency.
Extensive experiments on eight benchmarks demonstrate the effectiveness of our approach. With only 5\% of visual tokens retained, the large VLM preserves 95\% of its original performance, outperforming the state of the art by 8\%.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6351
Loading