Scaling Laws for Generative Reward Models

ICLR 2026 Conference Submission25601 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning From AI Feedback, RLHF, Reward Hacking
TL;DR: First end-to-end pipeline deploying trained GenRMs for online policy optimization, investigating scaling laws across model sizes, training budgets, and chain-of-thought reasoning
Abstract: We study the scaling behavior of generative reward models (GenRMs) for reinforcement learning from AI feedback (RLAIF) when used as drop-in replacements for Bradley-Terry models to optimize policies. Building on established scaling laws for reward model overoptimization, we investigate whether GenRMs, particularly those employing chain-of-thought reasoning, exhibit different robustness properties as policies drift from their training distribution during gradient updates. Using the Qwen3 model family (0.6B--14B), our study includes systematic evaluation of thinking GenRMs (trained via GRPO) against answer-only variants (trained via SFT) across policy size, reward model size, reward model type, training budget, and the parameter in online DPO. Our results show that the most decisive determinants of policy quality are reward model size and training duration, followed by policy model scale and GenRM type. While thinking variants trained with GRPO consistently outperform answer-only models on validation tasks, these substantial gains diminish when deployed for downstream policy optimization, where classifier-based reward models can match or exceed GenRM performance despite the latter's significant computational overhead. To measure alignment beyond saturated validation metrics, we employ ELO-based rankings, providing fine-grained proxy-gold alignment metrics that surpass the simple win rates against reference policies used in previous work.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 25601
Loading