Abstract: We unify two seemingly distinct approaches to policy gradient optimization for the Pass@K objective in reinforcement learning with verifiable rewards (RLVR): direct REINFORCE-style methods and advantage-shaping techniques that modify GRPO. By reverse-engineering existing advantage-shaping algorithms, we reveal that they implicitly optimize surrogate rewards. We specifically interpret practical ``hard-example upweighting'' modifications to GRPO as reward-level regularization. Conversely, starting from surrogate reward objectives, we provide a simple recipe for deriving both existing and new advantage-shaping methods.
This perspective provides a lens for RLVR beyond our original motivation of Pass@K.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Emmanuel_Bengio1
Submission Number: 6675
Loading