Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Jie Cheng; Gang Xiong; Ruixi Qiao; Lijun Li; Chao Guo; Junle Wang; Yisheng Lv; Fei-Yue Wang

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Jie Cheng, Gang Xiong, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Yisheng Lv, Fei-Yue Wang

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, reasoning, process reward model

Abstract: Process reward model (PRM) has been proven effective in test-time scaling of LLM on challenging reasoning tasks. However, the reward hacking induced by PRM hinders its successful applications in reinforcement fine-tuning. We find the primary cause of reward hacking induced by PRM is that: the canonical summation-form credit assignment in reinforcement learning (RL), i.e. cumulative gamma-decayed future rewards, causes the LLM to hack steps with high rewards. Therefore, to unleashing the power of PRM in training-time, we propose PURE: Process sUpervised Reinforcement lEarning. The core of PURE is the min-form credit assignment that defines the value function as the minimum future rewards. This method unifies the optimization objective with respect to process rewards during test-time and training-time, and significantly alleviates reward hacking due to the limits on the range of values of value function and more rational assignment of advantages. Through extensively experiments on 3 base models, we achieve similar reasoning performance using PRM-based approach compared with verifiable reward-based approach if enabling min-form credit assignment. In contrast, the canonical sum-form credit assignment even collapses training at the beginning. Moreover, when we incorporate 1/10th verifiable rewards to auxiliary the PRM-based fine-tuning, it further alleviate reward hacking and results in the best fine-tuned model based on Qwen2.5-Math-7B with 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Furthermore, we summary the reward hacking cases we encountered during training and analysis the cause of training collapse.

Supplementary Material: zip

Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)

Submission Number: 2569

Loading