Keywords: Multimodal Reasoning, Process Reward Model, Monte Carlo Tree Search
TL;DR: We present MM-PRM, a unified and scalable framework for training multimodal Process Reward Models with MCTS-generated step-level supervision, significantly improving mathematical multi-step reasoning performance without manual annotations.
Abstract: While Multimodal Large Language Models (MLLMs) have achieved impressive progress in vision-language understanding, they still struggle with complex multi-step reasoning. A key limitation lies in the lack of fine-grained supervision over intermediate reasoning steps. To address this, we propose MM-PRM: a unified, scalable framework for building Process Reward Models (PRMs) in multimodal settings. We first build MM-Policy-8B, a strong multimodal policy model trained on diverse mathematical reasoning data. Then, we construct MM-K12, a curated dataset of 10,000 multimodal math problems with verifiable answers, which serves as seed data. Leveraging a Monte Carlo Tree Search (MCTS)-based pipeline, we generate over 700k step-level annotations without human labeling. The resulting MM-PRM-8B is used to rerank candidate reasoning paths and achieves significant improvements across both in-domain and out-of-domain benchmarks. MM-PRM demonstrates that process supervision is a powerful tool for enhancing the logical robustness of multimodal reasoning systems. We release all our codes and data at https://anonymous.4open.science/r/MM-PRM-F608/.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 15163
Loading