SparseVILA-R1: Decoupling Visual Sparsity for Efficient VLM Reasoning

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER Workshop SpotlightEveryoneRevisionsBibTeXCC BY 4.0
Keywords: VLM, Sparsity
TL;DR: Token Sparsity for Efficient Reasoning with Vision Language Models.
Abstract: Enabling Vision Language models (VLMs) to $\textit{reason}$ requires operating over long chains of multimodal evidence grounded in video and physical interaction. The computation profile of such reasoning VLMs differs starkly from standard VQA-style inference (visual question answering). Reasoning VLMs typically generate large numbers of decoding tokens, hence shifting the latency distribution to the decoding stage and bottlenecking inference cost with token throughput. We present SparseVILA-R1, an inference-time, token sparsity approach tailored to visual reasoning. Through $\textit{decoupling}$ prefill and decoding sparsities, SparseVILA-R1 is able to strategically target token reduction, achieving up to 1.9$\times$ speedup with limited performance degradation. By aligning sparsity with the compute profile of visual reasoning models, SparseVILA-R1 preserves cross-modal grounding while improving end-to-end efficiency, operating at the speed-accuracy Pareto frontier for long-context visual reasoning.
Submission Number: 292
Loading