Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

Limits and Gains of Test-Time Scaling in Vision-Language Reasoning

24 Feb 2026 (modified: 11 Mar 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Test-time scaling (TTS) has emerged as a powerful paradigm for improving the reasoning ability of Large Language Models (LLMs) by allocating additional computation at inference, yet its application to multimodal systems such as Vision-Language Models (VLMs) remains underexplored. In this work, we present a systematic empirical study of inference-time reasoning methods applied across both open-source and closed-source VLMs on different benchmarks. Our results reveal that while closed-source models consistently benefit from structured reasoning and iterative Self-Refinement, open-source VLMs show inconsistent behavior: external verification provides the most reliable gains, whereas iterative refinement often degrades performance. We further find that the effectiveness of TTS is dataset-dependent, yielding clear improvements on multi-step reasoning tasks but offering only limited gains on perception-focused benchmarks. These findings demonstrate that TTS is not a universal solution and must be tailored to both model capabilities and task characteristics, motivating future work on adaptive TTS strategies and multimodal reward models.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Jose_Dolz1

Submission Number: 7658

Loading