Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Shenao Zhang; Donghan Yu; Hiteshi Sharma; Ziyi Yang; Shuohang Wang; Hany Hassan Awadalla; Zhaoran Wang

Self-Exploring Language Models: Active Preference Elicitation for Online Alignment

Shenao Zhang, Donghan Yu, Hiteshi Sharma, Ziyi Yang, Shuohang Wang, Hany Hassan Awadalla, Zhaoran Wang

Published: 17 Jun 2024, Last Modified: 17 Jun 2024AutoRL@ICML 2024EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Alignment, Large Language Models, RLHF

Abstract: Preference optimization, particularly through Reinforcement Learning from Human Feedback (RLHF), has achieved significant success in aligning Large Language Models (LLMs) to adhere to human intentions. Unlike offline alignment with a fixed dataset, online feedback collection from humans or AI on model generations typically leads to more capable reward models and better-aligned LLMs through an iterative process. However, achieving a globally accurate reward model requires systematic exploration to generate diverse responses that span the vast space of natural language. Random sampling from standard reward-maximizing LLMs alone is insufficient to fulfill this requirement. To address this issue, we propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. By solving the inner-level problem with the reparameterized reward function, the resulting algorithm, named \textit{Self-Exploring Language Models} (SELM), eliminates the need for a separate RM and iteratively updates the LLM with a straightforward objective. Compared to \textit{Direct Preference Optimization} (DPO), the SELM objective reduces indiscriminate favor of unseen extrapolations and enhances exploration efficiency. Our experimental results demonstrate that when finetuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, SELM significantly boosts the performance on instruction-following benchmarks such as MT-Bench and AlpacaEval 2.0, as well as various standard academic benchmarks in different settings.

Supplementary Material: zip

Submission Number: 30

Loading