SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
TL;DR: We propose an efficient text-aware training-free vision token optimization mechanism called SparseVLM.
Abstract: In vision-language models (VLMs), visual tokens usually consume a significant amount of computational overhead, despite their sparser information density compared to text tokens. To address this, most existing methods learn a network to prune redundant visual tokens and require additional training data. Differently, we propose an efficient training-free token optimization mechanism dubbed **SparseVLM** without extra parameters or fine-tuning costs. Concretely, given that visual tokens complement text tokens in VLMs for linguistic reasoning, we select visual-relevant text tokens to rate the significance of vision tokens within the self-attention matrix extracted from the VLMs. Then we progressively prune irrelevant tokens. To maximize sparsity while retaining essential information, we introduce a rank-based strategy to adaptively determine the sparsification ratio for each layer, alongside a token recycling method that compresses pruned tokens into more compact representations. Experimental results show that our SparseVLM improves the efficiency of various VLMs across a range of image and video understanding tasks. In particular, when LLaVA is equipped with SparseVLM, it achieves a 54\% reduction in FLOPs, lowers CUDA time by 37\%, and maintains an accuracy rate of 97\%. Our code is available at https://github.com/Gumpest/SparseVLMs.
Lay Summary: Vision-language models (VLMs) combine images and text to perform tasks like answering questions about pictures or videos. However, processing visual information can be slow and inefficient because these models analyze every part of an image, even when many parts aren’t relevant. Our method, SparseVLM, makes VLMs faster and more efficient by automatically identifying and removing unnecessary visual details while keeping the important ones. Unlike other approaches that require extra training, SparseVLM works without any modifications to the original model. It uses the model’s attention patterns, how much the text "focuses" on different parts of the image, to decide which visual details can be safely removed. Additionally, it recycles removed details into a simpler form to save even more computation. Experiments show that SparseVLM can cut computation costs by over 50% and speed up processing by 37% while maintaining nearly the same accuracy. This makes VLMs more practical for real-world applications without sacrificing performance.
Link To Code: https://github.com/Gumpest/SparseVLMs
Primary Area: Deep Learning->Algorithms
Keywords: Sparsification, Vision Language Models, Efficiency
Submission Number: 4137
Loading