Keywords: Reinforcement Learning From AI Feedback, RLHF, Reward Hacking
TL;DR: First end-to-end pipeline deploying trained GenRMs for online policy optimization, investigating scaling laws across model sizes, training budgets, and chain-of-thought reasoning
Abstract: We study the scaling behavior of generative reward models (GenRMs) for reinforcement learning from AI feedback (RLAIF) when used as drop-in replacements for Bradley-Terry models to optimize policies. Building on established scaling laws for reward model overoptimization, we investigate whether GenRMs, particularly those employing chain-of-thought reasoning, exhibit different robustness properties as policies drift from their training distribution during gradient updates. Using the Qwen3 model family (0.6B--14B), our study includes systematic evaluation of thinking GenRMs (trained via GRPO) against answer-only variants (trained via SFT) across policy size, reward model size, reward model type, training budget, and the $\beta$ parameter in online DPO. Our results reveal a consistent evaluator-rewarder gap: thinking GenRMs outperform answer-only variants by 1--2\% on validation tasks, yet these gains diminish---and often reverse---during policy optimization, where answer-only GenRMs achieve higher Gold Elo and more stable proxy-Gold alignment. We find that reward model scale is the most decisive factor for policy quality, with gains continuing even when the GenRM far exceeds the policy in parameters. Moreover, intermediate GRPO checkpoints of thinking judges can outperform fully-trained checkpoints as rewarders, despite worse static accuracy. We track these dynamics with Elo arenas under both proxy and Gold evaluation, providing a fine-grained proxy--Gold alignment diagnostic beyond saturated validation metrics.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 25601
Loading