Keywords: Reward Hacking, Reasoning LLMs, Process Reward Model, False Positive Bias
Abstract: Process reward models (PRMs) supply step level supervision for reasoning LLMs but often \textit{overcredit} incorrect steps, producing high false positives that steer decoding and accumulate across long chains. We show analytically that false positives impose an asymptotic ceiling on Best of N alignment, whereas false negatives mainly slow convergence. To mitigate this, we introduce a label efficient recipe: convert existing step annotations into positive–negative pairs, train with a novel Overcredit Contrastive (OC) loss, and rebalance using lightweight negative augmentation and a simple curriculum. On PRMBench, our approach substantially lowers false positives and improves macro F1 over strong discriminative and generative PRMs. When used for guided beam search and Best of N selection, the resulting PRMs deliver higher downstream accuracy and robustness. Our results indicate that comparison centered training with balanced step data is a practical path to trustworthy process supervision without new human labels.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18670
Loading