Active Preference Optimization for Sample Efficient RLHF

Published: 18 Jun 2024, Last Modified: 11 Jul 2024TF2M 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Active Learning, Preference Bandits, RLHF, Sample Efficiency
TL;DR: An active learning approach to preference data collection in RLHF to provably improve model alignment for constrained sample budget.
Abstract: Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models (LLMs) with human preferences. Although aligned LLMs have shown remarkable abilities in numerous tasks, their reliance on high-quality human preference data creates a costly bottleneck. Current methods for RLHF rely on uniformly picking prompt-generation pairs from a dataset of prompt-generations, to collect human feedback. For limited number of human feedback samples, we show that this leads to sub-optimal alignment. Next, we develop an active-learning algorithm, $\textit{Active Preference Optimization}$ ($\texttt{APO}$), which significantly enhances model alignment by querying preference data for the most important samples, thus achieving superior performance at a small sample budget. We analyze the theoretical performance guarantees of $\texttt{APO}$ showing that the suboptimality gap of the policy learned via $\texttt{APO}$ scales as $O(1/\sqrt{T})$ for a sample budget of $T$. We perform detailed experimental evaluations on practical preference datasets to validate $\texttt{APO}$'s efficacy over the existing methods, establishing it as a sample-efficient and practical solution of alignment in a cost-effective and scalable manner.
Submission Number: 42
Loading