Sample Efficient Preference Alignment in LLMs via Active Exploration

Viraj Mehta; Syrine Belakaria; Vikramjeet Das; Ojash Neopane; Yijia Dai; Ilija Bogunovic; Barbara E Engelhardt; Stefano Ermon; Jeff Schneider; Willie Neiswanger

Sample Efficient Preference Alignment in LLMs via Active Exploration

Viraj Mehta, Syrine Belakaria, Vikramjeet Das, Ojash Neopane, Yijia Dai, Ilija Bogunovic, Barbara E Engelhardt, Stefano Ermon, Jeff Schneider, Willie Neiswanger

Published: 08 Jul 2025, Last Modified: 26 Aug 2025COLM 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sample Efficient, DPO, RLHF, Alignment, Active learning, LLMs

TL;DR: We propose an exploration based approach for active learning for RLHF addressing both online and offline settings

Abstract: Preference-based feedback is important for many applications in machine learning where evaluation of a reward function is not feasible. Notable recent examples arise in preference alignment for large language models, including in reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO). For many applications of preference alignment, the cost of acquiring human feedback can be substantial. In this work, we take advantage of the fact that one can often choose contexts at which to obtain human feedback to most efficiently identify a good policy, and formalize the setting as an \emph{active contextual dueling bandit} problem. We propose an active exploration algorithm to efficiently select the data and provide theoretical proof that it has a polynomial worst-case regret bound. We extend the setting and methodology for practical use in preference alignment of large language models. We provide two extensions, an online and an offline approach. Our method outperforms the baselines with limited samples of human preferences on several language models and four real-world datasets including two new datasets that we contribute to the literature.

Supplementary Material: zip

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1456

Loading