OC-PRM: Overcredit-Contrastive Training for Precision-First Process Reward Models

Aakriti Agrawal; Souradip Chakraborty; Armin Saghafian; Nihal Sharma; Rizal Fathony; Nam H Nguyen; C. Bayan Bruss; Amrit Singh Bedi; Furong Huang

OC-PRM: Overcredit-Contrastive Training for Precision-First Process Reward Models

Aakriti Agrawal, Souradip Chakraborty, Armin Saghafian, Nihal Sharma, Rizal Fathony, Nam H Nguyen, C. Bayan Bruss, Amrit Singh Bedi, Furong Huang

Published: 02 Mar 2026, Last Modified: 14 Apr 2026AFAA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Track: Main Papers Track (6 to 9 pages)

Keywords: Reward Hacking, Process Reward Model, Policy Misalignment

TL;DR: False positives fundamentally limit PRM-guided reasoning; we propose a label-free, comparison-based training method that reduces overcrediting and yields more reliable process supervision and better downstream accuracy.

Abstract: Process reward models (PRMs) offer step-level supervision for reasoning LLMs, but in practice they often \emph{overcredit} incorrect steps, inducing high false positive rates that distort decoding and compound over long chains. We show analytically that in Best-of-$N$ selection, false positives impose an asymptotic alignment ceiling (set by the PRM's precision), whereas false negatives primarily increase sample complexity and slow convergence. Motivated by this asymmetry, we introduce a label-efficient training recipe that requires no new human annotation: we convert existing step labels into matched positive-negative comparisons, optimize a novel \emph{Overcredit Contrastive (OC)} objective, and rebalance supervision using lightweight negative augmentation and a simple difficulty curriculum. On PRMBench~\citep{song2025prmbench}, our method sharply reduces false positives and improves macro F1 over strong discriminative and generative PRMs. When deployed for guided beam search and Best-of-$N$ selection, the resulting PRMs yield higher downstream task accuracy and improved robustness. Overall, our results suggest that comparison-centered training with balanced step data provides a practical path to trustworthy process supervision without additional human labels.

Submission Number: 47

Loading