Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

Advantage Shaping as Surrogate Reward Maximization: Unifying Pass@K Policy Gradients

TMLR Paper6675 Authors

27 Nov 2025 (modified: 20 Feb 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We unify two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards (RLVR): direct REINFORCE-style methods and advantage-shaping techniques that modify GRPO. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical ``hard-example upweighting'' modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods. This perspective provides a lens for RLVR beyond our original motivation of Pass@K.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Emmanuel_Bengio1

Submission Number: 6675

Loading