R-PRM: Reasoning-Driven Process Reward Modeling

ACL ARR 2025 February Submission8248 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) inevitably make mistakes when performing step-by-step mathematical reasoning. Process Reward Models (PRMs) have emerged as a promising solution by evaluating each reasoning steps. However, existing PRMs typically output evaluation scores directly, limiting both learning efficiency and evaluation accuracy that are further exacerbated by scarcity of annotated data. To address these issues, we propose Reasoning-Driven Process Reward Modeling (R-PRM). First, we leverage stronger LLMs to generate seed data from limited annotations, effectively bootstrapping our model's reasoning capabilities and enabling comprehensive step-by-step evaluation. Second, we further enhance performance through preference optimization, without requiring additional annotated data. Third, we introduce inference-time scaling to fully harness the model's reasoning potential. Extensive experiments demonstrate R-PRM's effectiveness: on ProcessBench and PRMBench, it surpasses strong baselines by 11.9 and 8.5 F1 scores respectively. When applied to guide mathematical reasoning, R-PRM achieves consistent accuracy improvements of over 8.5 points across six challenging datasets. Further analysis reveals that R-PRM exhibits better evaluation comprehensiveness and generalization capabilities, with providing additional performance gains and underscoring its potential.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Reasoning, Process Reward Model
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 8248
Loading