Cut the Overcredit: Precision First Process Rewards for Reasoning LLMs

ICLR 2026 Conference Submission18670 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reward Hacking, Reasoning LLMs, Process Reward Model, False Positive Bias
Abstract: Process reward models (PRMs) supply step level supervision for reasoning LLMs but often \textit{overcredit} incorrect steps, producing high false positives that steer decoding and accumulate across long chains. We show analytically that false positives impose an asymptotic ceiling on Best of N alignment, whereas false negatives mainly slow convergence. To mitigate this, we introduce a label efficient recipe: convert existing step annotations into positive–negative pairs, train with a novel Overcredit Contrastive (OC) loss, and rebalance using lightweight negative augmentation and a simple curriculum. On PRMBench, our approach substantially lowers false positives and improves macro F1 over strong discriminative and generative PRMs. When used for guided beam search and Best of N selection, the resulting PRMs deliver higher downstream accuracy and robustness. Our results indicate that comparison centered training with balanced step data is a practical path to trustworthy process supervision without new human labels.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18670
Loading