Reinforcement Reward Model with Policy Feedback

05 Sept 2025 (modified: 08 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning, RLHF, Reward Model
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward hacking, a phenomenon that policy models exploit spurious reward patterns instead of faithfully capturing human intent. Prior work to mitigate reward hacking primarily relies on surface semantic information and fails to efficiently address the misalignment between the reward model and the policy model caused by continuous policy distribution shifts. This inevitably leads to an increasing reward discrepancy, exacerbating reward hacking. To address these limitations, we propose R2M (Reinforcement Reward Model), a novel lightweight RLHF framework. Specifically, we aim to go beyond vanilla reward models that solely depend on the semantic representations of a pretrained LLM. Instead, we enhance the reward model by incorporating the evolving hidden states of the policy (namely policy feedback). we redesign the scoring head of the reward model to integrate policy feedback and introduce a corresponding iterative lightweight training phase, utilizing real-time policy feedback to enable adaption to policy distribution shifts. Notably, without modifying the core RLHF algorithms, simply integrating R2M enables the reward model to achieve iterative distribution alignment with accurate reward allocation, yielding 4.8\% to 5.6\% win rate improvement on dialogue tasks and 6.3\% win rate improvement on document summarization tasks, while introducing marginal computational cost. This work points to a promising new direction for improving the performance of reward models through real-time utilization of feedback from policy models.
Supplementary Material: zip
Primary Area: reinforcement learning
Submission Number: 2312
Loading