Polymorphic: Plug-and-Play Visual Token Compression for Scalable VLMs

ICLR 2026 Conference Submission17299 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: pruning, vision language model
Abstract: Recent advances in vision language models (VLMs) have enabled strong reasoning and generalization capabilities, but they remain computationally expensive, primarily due to the quadratic complexity of Transformer self-attention and the large number of visual tokens produced by high-resolution inputs. To address this limitation, we propose a flexible plug-and-play framework for visual token pruning that can be seamlessly integrated into existing VLMs without requiring additional training or model modification. Our approach employs a two-stage strategy. In the first stage, representation-level token merging is performed based on spatial information density, which removes redundant visual features. In the second stage, tokens with low cross-modal relevance are adaptively pruned during language model prefilling, allowing the computation to focus on the most informative regions. This design substantially reduces the visual token budget, leading to improvements in both inference speed and memory efficiency while maintaining strong task performance. Extensive experiments on widely used benchmarks demonstrate that our method consistently achieves superior efficiency and accuracy trade-offs, highlighting its potential for practical deployment of high-resolution VLMs in real-world applications.
Primary Area: generative models
Submission Number: 17299
Loading