Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

TMLR Paper6675 Authors

27 Nov 2025 (modified: 01 Dec 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We unify two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards (RLVR): direct REINFORCE-style methods and advantage-shaping techniques that modify GRPO. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical ``hard-example upweighting'' modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR beyond our original motivation of Pass@K.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Emmanuel_Bengio1
Submission Number: 6675
Loading