Learning Discriminative Process Reward Models without Step Labels

Published: 03 Mar 2026, Last Modified: 03 Mar 2026SPOTEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Process Reward Model; LLM Reasoning
Abstract: Process reward models (PRMs) can improve LLM reasoning by providing step-level feedback, but training them typically depends on costly step annotations, which limits scalability. Final outcome labels, in contrast, are inexpensive to obtain; however, existing outcome-supervised methods either learn outcome-only reward models that offer no stepwise guidance, or adopt implicit reward formulations that increase inference cost and are susceptible to reward hacking. To address these limitations, we propose a new framework for learning discriminative PRMs using only outcome labels. Our approach treats step quality as a latent variable and connects it to the outcome via an aggregation function that emphasizes low-scoring steps (e.g. geometric mean). This enables end-to-end training by backpropagating the outcome loss through the aggregator to step-level scores. The resulting classifier-style PRMs are efficient at inference and achieve competitive performance on challenging reasoning tasks, both for test-time search and on PRM benchmarks.
Submission Number: 82
Loading