PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization

PRPO: Aligning Process Reward with Outcome Reward in Policy Optimization

ACL ARR 2026 January Submission3054 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large Language Model; Reinforcement Learning; Process Reward Model; RLVR

Abstract: Policy optimization for large language models often suffers from sparse reward signals in multi-step reasoning tasks. Critic-free methods like GRPO assign a single normalized outcome reward to all tokens, providing limited guidance for intermediate reasoning . While Process Reward Models (PRMs) offer dense feedback, they risk premature collapse when used alone, as early low-reward tokens can drive policies toward truncated outputs. We introduce Process Relative Policy Optimization (PRPO), which combines outcome reliability with process-level guidance in a critic-free framework. PRPO segments reasoning sequences based on semantic clues, normalizes PRM scores into token-level advantages, and aligns their distribution with outcome advantages through location-parameter shift. On MATH500, PRPO improves Qwen2.5-Math-1.5B accuracy from 61.2\% to 64.4\% over GRPO using only eight rollouts and no value network, demonstrating efficient fine-grained credit assignment within critic-free optimization.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: Language Modeling; Machine Learning for NLP

Contribution Types: NLP engineering experiment, Theory

Languages Studied: English

Submission Number: 3054

Loading