Preventing Process Reward Model Hacking When Training Large Language Models on Verifiable Rewards
Keywords: Large Language Models
TL;DR: Process Reward Models can be hacked, but we draw on optimality-preserving reward shaping literature to prevent this.
Abstract: Alignment of Large Language Models to human preferences is an active and important field of study. Recent work in Reinforcement Learning with Verifiable Rewards (RLVR) aims to bypass the need for costly and imprecise human preference reward data by training LLMs specifically in domains wherein a simple, known solution exists. As the RLVR signal is sparse, however, it is often supplemented with a Process Reward Model (PRM) reward, which provides a dense reward for each token, or step in the chain of thought, for an agent to learn in. Using the VersaPRM extension to the MMLU-Pro dataset, we demonstrate that PRMs are susceptible to reward hacking behavior, wherein the model is incentivized to produce particularly long, plausible-seeming chains of thought which do not result in the correct response. We also develop a theoretical framework and suite of methods for preventing this reward hacking while still utilizing PRMs effectively, based on recent work in potential-based and optimality-preserving reward shaping. We both prove theoretically and demonstrate practically that these methods prevent PRMs from altering the optimal policy, and thus from being optimized at the expense of the RLVR signal.
Area: Generative and Agentic AI (GAAI)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 1561
Loading