Preference-Based Process Reward Model for Robust Mathematical Reasoning

Preference-Based Process Reward Model for Robust Mathematical Reasoning

ICLR 2026 Conference Submission23801 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Process Reward Model, Reinforcement Learning, Monte Carlo Tree Search

Abstract: Process reward models (PRMs) have emerged as a promising approach to guide LLMs by providing step-wise supervision, but traditional methods often rely on heuristic search strategies like Monte Carlo Tree Search (MCTS), which introduce bias and limit generalization. In this work, we propose a reinforcement learning framework guided by a Preference-Based Process Reward Model (PPRM) , which provides step-wise supervision to refine reasoning trajectories. We first employ MCTS to estimate and select chosen and rejected rollouts, thereby constructing a high-quality step-level dataset. Our PPRM is trained on Bradley-Terry loss function, which mitigates the bias introduced by the heuristic search strategies of MCTS by leveraging preference-based learning. To enable effective RL training with PPRM, we enhance Group Relative Policy Optimization (GRPO) by introducing a robust advantage estimator that better captures the structure of preference-based process reward model enabling stable and efficient policy optimization. Experimental results on ProcessBench and best-of-n strategy demonstrate that our approach achieves $2$-$3\%$ improvement in intermediate step accuracy compared to existing methods for complex reasoning processes, thereby improving the reasoning accuracy of the policy model across several key reasoning benchmarks.

Primary Area: foundation or frontier models, including LLMs

Submission Number: 23801

Loading