Mitigating Reward Overoptimization via Lightweight Uncertainty Estimation

Published: 25 Sept 2024, Last Modified: 06 Nov 2024NeurIPS 2024 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Overoptimization in RLHF, Lightweight Uncertainty Estimation, Adversarial Policy Optimization
TL;DR: We present a novel solution to the prevalent problem of reward overoptimization in RLHF through adversarial policy optimization with lightweight uncertainty estimation.
Abstract: Reinforcement Learning from Human Feedback (RLHF) has been pivotal in aligning Large Language Models with human values but often suffers from overoptimization due to its reliance on a proxy reward model. To mitigate this limitation, we first propose a lightweight uncertainty quantification method that assesses the reliability of the proxy reward using only the last layer embeddings of the reward model. Enabled by this efficient uncertainty quantification method, we formulate AdvPO, a distributionally robust optimization procedure to tackle the reward overoptimization problem in RLHF. Through extensive experiments on the Anthropic HH and TL;DR summarization datasets, we verify the effectiveness of AdvPO in mitigating the overoptimization problem, resulting in enhanced RLHF performance as evaluated through human-assisted evaluation.
Primary Area: Natural language processing
Submission Number: 7983
Loading