Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We study the reward transfer in online RLHF. We propose a theoretical transfer learning algorithm with provable benefits, and then develop an empirical version with improved scalability and experimental evaluations.
Abstract: Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: \emph{a policy's coverability of the optimal policy is captured by its sub-optimality}. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm---**T**ransfer **P**olicy **O**ptimization (**TPO**)---with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy selection strategy with improved computational efficiency. Moreover, our empirical transfer learning technique is modular and can be integrated with various policy optimization methods, such as DPO, IPO and XPO, to further enhance their performance. We validate the effectiveness of our method through experiments on summarization tasks.
Lay Summary: Reinforcement Learning from Human Feedback (RLHF) is a key step in fine-tuning large language models (LLMs), but collecting human feedback is expensive. This makes improving sample efficiency—learning from fewer annotations—an essential goal. While most works focus on better exploration or modeling techniques, we take a different approach: **can we speed up learning by transferring knowledge from any reward models available, even if they’re imperfect**? We introduce **Transfer Policy Optimization (TPO)**, an algorithm with novel transfer learning strategies and provable benefits. Inspired by our theoretical findings, we also propose an empirical version of TPO, a scalable algorithm template that can leverage even flawed reward models to reduce the need for human feedback. Our work highlights an under-explored direction in RLHF: extracting and making use of information from imperfect signals to improve learning efficiency. This opens new possibilities for faster, cheaper, and more flexible training of LLMs.
Link To Code: https://github.com/jiaweihhuang/RLHF_RewardTransfer
Primary Area: Theory->Reinforcement Learning and Planning
Keywords: Reinforcement Learning from Human Feedback, Reward transfer, sample-efficient reinforcement learning
Submission Number: 2154
Loading