Probabilistic Uncertain Reward Model

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: reward model, reward hacking, RLHF, uncertainty
TL;DR: we propose a Probabilistic Uncertain Reward Model (PURM), a natural generalization of the classical Bradley-Terry reward model, that can directly learn the reward distribution emerged from the preference data.
Abstract: Reinforcement learning from human feedback (RLHF) is a critical technique for training large language models. However, conventional reward models based on the Bradley-Terry model (BTRM) often suffer from overconfidence when faced with inconsistent labels or out-of-distribution samples, leading to reward hacking, where the policy model blindly optimizes for proxy rewards while degrading true performance. This paper proposes the Probabilistic Uncertain Reward Model (PURM), which generalizes the Bradley-Terry model to learn the reward distributions that emerged from the preference data. We theoretically derive the loss function of PURM and introduce a novel method that uses the overlap between distributions to define and derive the quantify uncertainty. Empirical results show that PURM outperforms existing methods with more accurate reward and sound uncertainty estimations, and sustains effective learning for more optimization steps and obtain higher maximum win rate in RLHF. The data and code of this paper are released at https://anonymous.4open.science/r/Probabilistic-Uncertain-Reward-Model/
Primary Area: foundation or frontier models, including LLMs
Submission Number: 6700
Loading