OC-PRM: Overcredit-Contrastive Training for Precision-First Process Reward Models

Published: 02 Mar 2026, Last Modified: 14 Apr 2026AFAA 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Main Papers Track (6 to 9 pages)
Keywords: Reward Hacking, Process Reward Model, Policy Misalignment
TL;DR: False positives fundamentally limit PRM-guided reasoning; we propose a label-free, comparison-based training method that reduces overcrediting and yields more reliable process supervision and better downstream accuracy.
Abstract: Process reward models (PRMs) offer step-level supervision for reasoning LLMs, but in practice they often \emph{overcredit} incorrect steps, inducing high false positive rates that distort decoding and compound over long chains. We show analytically that in Best-of-$N$ selection, false positives impose an asymptotic alignment ceiling (set by the PRM's precision), whereas false negatives primarily increase sample complexity and slow convergence. Motivated by this asymmetry, we introduce a label-efficient training recipe that requires no new human annotation: we convert existing step labels into matched positive-negative comparisons, optimize a novel \emph{Overcredit Contrastive (OC)} objective, and rebalance supervision using lightweight negative augmentation and a simple difficulty curriculum. On PRMBench~\citep{song2025prmbench}, our method sharply reduces false positives and improves macro F1 over strong discriminative and generative PRMs. When deployed for guided beam search and Best-of-$N$ selection, the resulting PRMs yield higher downstream task accuracy and improved robustness. Overall, our results suggest that comparison-centered training with balanced step data provides a practical path to trustworthy process supervision without additional human labels.
Submission Number: 47
Loading