ProofRM: A Scalable Pipeline to Train a Generalized Math Proof Reward Model

Haotong Yang; Zitong Wang; Shijia Kang; Siqi Yang; Wenkai Yu; Xu Niu; Yike Sun; Yi Hu; Zhouchen Lin; Muhan Zhang

ProofRM: A Scalable Pipeline to Train a Generalized Math Proof Reward Model

Haotong Yang, Zitong Wang, Shijia Kang, Siqi Yang, Wenkai Yu, Xu Niu, Yike Sun, Yi Hu, Zhouchen Lin, Muhan Zhang

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: math proof, reward model

TL;DR: We propose a scalable pipeline to train a generalized reward model for math proof.

Abstract: Large Language Models (LLMs) have stimulated strong math reasoning abilities through Reinforcement Learning with Verifiable Rewards (RLVR). Thanks to the simplicity of directly comparing answers, the reward is both accurate and scalable. However, more challenging mathematical problems like math Olympiad or genuine mathematical research often take the form of proof-based problems where there is no guaranteed way to determine the authenticity of a proof by simply matching the answers. A reward model that can accurately evaluate diverse full proof process is necessary to operate scalable and efficient reinforcement learning to solve these problems. In this paper, we design an *scalable* data construction process that, with minimal human involvement, leverages LLMs to generate a large quantity of high-quality and diverse ``*problem-proof-check*'' triplet data, which can be used to train the proof reward model through RLVR by rewarding the correct proof checking. By utilizing different proof-generating LLMs, proof generation methods, prompts, and problem sources, we ensure the diversity of the generated problem-proof pairs in terms of difficulty, length, language style. Our human check also support the high accuracy of the checking labels. With this data generation process, we train a proof evaluator that can accurately judge across diverse datasets. Our experiments, comparing to past baselines, validate the model's effectiveness from multiple perspectives, including reward accuracy, reinforcement learning effectiveness, and test-time guidance, providing important process references and tools for enhancing LLMs' mathematical capabilities.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 7625

Loading