Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned
Keywords: Multimodal Process Reward Model
Abstract: Process Reward Models (PRMs) provide step-level supervision that improves the reliability of reasoning in large language models. While PRMs have been extensively studied in text-based domains, their extension to Vision Language Models (VLMs) remains limited. Existing MultiModal PRMs typically rely on Monte Carlo Tree Search (MCTS) for data construction, which can often produce noisy supervision signals and limit generalization across tasks. In this work, we introduce VL-PRM, a Process Reward Model tailored for multimodal reasoning that expands the design space of multimodal PRMs by exploring diverse strategies for dataset construction, training, and test-time scaling. First, we introduce a hybrid data synthesis framework that combines MCTS with judgments from a strong multimodal LLM, producing more accurate step-level labels. Second, we propose perception-focused supervision, enabling our PRM to explicitly detect errors at the visual grounding stage of reasoning. Third, we systematically evaluate multiple test-time scaling strategies, showing that our PRMs can reliably guide VLMs toward more accurate solutions. Our experiments cover five diverse multimodal benchmarks (MMMU, PuzzleVQA, AlgoPuzzleVQA, MathVista, and MathVision) and reveal several key insights: (i) smaller VL-PRMs can match or even surpass larger ones in detecting process errors, (ii) VL-PRMs uncover latent reasoning abilities in stronger VLM backbones, and (iii) perception-level supervision leads to significant gains in test-time scaling. Together, these findings demonstrate that VL-PRMs not only reduce hallucinations but also enhance general reasoning capabilities of VLMs, offering a lightweight yet powerful intervention for multimodal reasoning.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 18195
Loading