VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Thomas Zeng; Shuibai Zhang; Shutong Wu; Christian Classen; Daewon Chae; Ethan Ewer; Minjae Lee; Heeju Kim; Wonjun Kang; Jackson Kunde; Ying Fan; Jungtaek Kim; Hyung Il Koo; Kannan Ramchandran; Dimitris Papailiopoulos; Kangwook Lee

VersaPRM: Multi-Domain Process Reward Model via Synthetic Reasoning Data

Thomas Zeng, Shuibai Zhang, Shutong Wu, Christian Classen, Daewon Chae, Ethan Ewer, Minjae Lee, Heeju Kim, Wonjun Kang, Jackson Kunde, Ying Fan, Jungtaek Kim, Hyung Il Koo, Kannan Ramchandran, Dimitris Papailiopoulos, Kangwook Lee

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 oralEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: Math PRMs show limited generalizability beyond math. The fix: further train on a synthetically generated multi-domain CoT dataset.

Abstract: Process Reward Models (PRMs) have proven effective at enhancing mathematical reasoning for Large Language Models (LLMs) by leveraging increased inference-time computation. However, they are predominantly trained on mathematical data and their generalizability to non-mathematical domains has not been rigorously studied. In response, this work first shows that current PRMs have poor performance in other domains. To address this limitation, we introduce ***VersaPRM***, a multi-domain PRM trained on synthetic reasoning data generated using our novel data generation and annotation method. VersaPRM achieves consistent performance gains across diverse domains. For instance, in the MMLU-Pro category of Law, VersaPRM via weighted majority voting, achieves a 7.9% performance gain over the majority voting baseline–surpassing Qwen2.5-Math-PRM's gain of 1.3%. We further contribute to the community by open-sourcing all data, code and models for VersaPRM.

Lay Summary: Recent advances have shown that Large Language Models (like ChatGPT) solve math problems more effectively if they explain their thinking process step by step before giving a final answer. This ability can be further improved with Process Reward Models—models that check and grade each step of the reasoning process for correctness. However, previous work on Process Reward Models has mostly focused on math problems. We find that these models don’t perform well on questions from other areas, such as Law or Biology. To address this, we introduce a new Process Reward Model called VersaPRM, which is trained on a more diverse set of reasoning tasks. As a result, VersaPRM can help Large Language Models reason better across a wider range of subjects—not just math.

Primary Area: Deep Learning->Large Language Models

Keywords: Proccess Reward Model

Submission Number: 409

Loading