OraclePRM: Unlocking the Potential of Each Instance for Multimodal Process Reward Model Training

09 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Reasoning, Process Reward Model, Bi-level Optimization, Instance-reweighting
Abstract: Training multimodal process reward models (PRMs) is hard due to (i) distribution shift between training set and test set and (ii) quality imbalance across training data samples. While domain-level reweighting (e.g., DreamPRM) aligns training with test-time objectives, it leaves a clear gap to an oracle upper bound (pass@N), even under a “sanity check” that uses test set data to probe headroom—pointing to meta-level under-parameterization. We introduce OraclePRM, an instance-level reweighting framework that assigns an adaptive weight to every training example via bi-level optimization. To realize instance reweighting across scales, we develop two complementary regimes: Instance Table, which learns explicit per-sample weights and excels on small/medium data, and Instance Net, a lightweight neural network that generalizes better and scales to large corpora. A practical, stable training recipe—time-scale matching between upper/lower updates, cold-start initialization, and bounded-range weights—prevents divergence. Integrated with test-time scaling, OraclePRM attains 84.6 accuracy on the MMMU validation set and, when paired with a leading backbone (e.g., GPT-5-mini), achieves first-place results on public multimodal reasoning leaderboards. Moreover, extensive experiments, including benchmark evaluations, baseline comparisons, and a sanity check, demonstrate that OraclePRM closes the gap toward the oracle, achieves leading performance, and trains stably.
Supplementary Material: zip
Primary Area: foundation or frontier models, including LLMs
Submission Number: 3522
Loading