Learning to Reward: A Contextual Bandit Framework for Distributional Reward Policy Optimization

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM, Reward model, Preference learning, RL
Abstract: Reward models (RMs) are a cornerstone of aligning large language models (LLMs) with human preferences, yet their ability to faithfully represent the nuance and uncertainty of these preferences remains a critical challenge. Existing RMs can be classified by their output (point-estimate vs. distributional) and training paradigm (supervised vs. reinforcement learning). While prior work has addressed three of these four quadrants, no existing framework combines the dynamic optimization of reinforcement learning with the rich, uncertainty-aware representation of a distributional model. To fill this gap, we propose \textbf{D}istributional \textbf{R}eward \textbf{P}olicy \textbf{O}ptimization (\textbf{DRPO}). DRPO formulates the learning of a distributional reward model as a contextual bandit problem, where the RM itself acts as a stochastic policy. This policy is trained using a uncertainty-aware meta-reward signal derived directly from the statistics of the reward distributions. We provide a algorithm analysis of DRPO's gradient dynamics and conduct extensive experiments to demonstrate its effectiveness. Our code and data are available at \url{https://anonymous.4open.science/r/DRPO/}.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 9353
Loading