Uncertainty-aware Preference Alignment in Reinforcement Learning from Human Feedback

Published: 17 Jun 2024, Last Modified: 02 Jul 2024ICML 2024 Workshop MHFAIA PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Reinforcement Learning from Human Feedback, Distributional Reinforcement Learning, Risk-Sensitive Reinforcement Learning
TL;DR: We introduce an uncertainty-aware preference alignment framework in RLHF by learning a distributional reward model and a risk-sensitive policy from the offline preference dataset.
Abstract: Recent advances in Reinforcement Learning from Human Feedback (RLHF) typically model a reward function by maximizing its likelihood of generating observed human preferences. However, due to the diverse backgrounds of individuals, these preference signals are inherently stochastic. This inherent uncertainty in the preference signals can lead to unstable or unsafe behaviors in the process of reward and policy updates. In this work, we introduce the uncertainty-aware preference alignment in RLHF by learning a distributional reward model and a risk-sensitive policy from the offline preference dataset. Specifically, we propose a Maximum A Posteriori (MAP) objective for updating the reward associated with a trajectory. This updating process incorporates an informative prior to account for the uncertainty in human preferences. Utilizing this updated reward sample, we develop a generative reward model to represent the reward distribution. Driven by the inherent stochasticity in the reward models, we utilize the offline distributional Bellman operator and the Conditional Value-at-Risk (CVaR) metric to learn a risk-sensitive policy from the offline dataset. Experimental results show that the risk-sensitive RLHF agent can effectively identify and avoid states with significant stochasticity, thereby enabling risk-averse control in different tasks.
Submission Number: 35
Loading