GroundedPRM: Tree-Guided and Fidelity-Aware Process Reward Modeling for Step-Level Reasoning

ACL ARR 2026 January Submission5039 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: process reward model, step-level supervision, mathematical reasoning, Monte Carlo tree search, tool-augmented verification, reward modeling, test-time search
Abstract: Process Reward Models (PRMs) supervise multi-step reasoning in Large Language Models (LLMs) by evaluating intermediate steps rather than only final outcomes. However, training effective PRMs remains challenging: human annotations are expensive, LLM-based evaluators hallucinate, and Monte Carlo (MC) based supervision infers step quality solely from rollout outcomes, resulting in noisy rewards and significant credit misattribution. These limitations lead to low factual fidelity, imprecise step-level supervision, and misalignment with true reasoning objectives. We introduce GroundedPRM, a tree-guided and fidelity-aware framework for automatic process supervision. GroundedPRM constructs structured reasoning paths via Monte Carlo Tree Search (MCTS), providing more precise and context-informed credit signals than flat Monte Carlo rollouts. Each intermediate step is then validated using an external execution tool, producing accurate, execution-grounded correctness labels. To combine step-level fidelity with global reasoning quality, we design a hybrid reward that integrates local tool-based validation with global MCTS-derived consistency feedback. The final supervision signal is formatted in a rationale-enhanced structure that improves interpretability while remaining compatible with instruction-tuned LLMs. Trained on only 40K automatically labeled samples, just 10% of the data used to train the strongest PRM built on auto-labeled data—GroundedPRM achieves up to a 26% relative improvement on ProcessBench. When used for reward-guided greedy search, it even outperforms PRMs trained with human-labeled data. GroundedPRM provides an automatic, fidelity-driven, and interpretable approach to process-level reasoning in LLMs.
Paper Type: Long
Research Area: Mathematical, Symbolic, Neurosymbolic, and Logical Reasoning
Research Area Keywords: mathematical reasoning, step-by-step reasoning, process supervision, reward modeling, symbolic verification
Contribution Types: NLP engineering experiment, Data analysis
Languages Studied: English
Submission Number: 5039
Loading