Reasoning with Fewer Eyes: Efficient Visual Token Withdrawal for Multimodal Reasoning

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Multimodal Reasoning, Inference Efficiency, Vision Token Withdrawal
TL;DR: We speed up reasoning VLM by dropping vision tokens after M generation steps. No training required. Compatible with popular efficiency techniques.
Abstract: Vision-Language models have shown strong promise for multimodal reasoning tasks, where autoregressive generation allows the model to combine perception and abstract reasoning. However, especially when processing high-resolution images or long videos, the large number of visual tokens severely slows down inference. Drawing from the observation that attention devoted to vision tokens consistently drops during autoregressive text generation, we propose a simple method to accelerate multimodal reasoning: after the model has generated a small number of text tokens, we remove all vision tokens from subsequent decoding steps. This reduces both memory usage and computation, while retaining the model’s ability to ground its reasoning in the visual input. Our approach requires no additional training and is fully compatible with popular efficiency techniques such as KV caching and FlashAttention. Experiments on multiple datasets and with different models demonstrate that our method achieves substantial speedups with minimal impact on reasoning accuracy.
Submission Number: 108
Loading