SAIL: Self-improving Efficient Online Alignment of Large Language Models

Mucong Ding; Souradip Chakraborty; Vibhu Agrawal; Zora Che; Alec Koppel; Mengdi Wang; Amrit Bedi; Furong Huang

SAIL: Self-improving Efficient Online Alignment of Large Language Models

Mucong Ding, Souradip Chakraborty, Vibhu Agrawal, Zora Che, Alec Koppel, Mengdi Wang, Amrit Bedi, Furong Huang

Published: 18 Jun 2024, Last Modified: 03 Jul 2024TF2M 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: RLHF, Alignment, Iterative Alignment

TL;DR: Online Adaptive Direct Preference Optimization

Abstract: Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. Current offline RLHF methods rely on fixed preference datasets, which can lead to sub-optimal performance. Current online RLHF methods lack a unified conceptual formulation and suffer from distribution shifts. We establish that online LLM alignment is underpinned by bilevel optimization. By reducing this formulation to an efficient single-level first-order method (using the reward-policy equivalence), our approach generates new samples and iteratively refines model alignment. Thus, we perform alignment in an online and self-improving manner and generalize prior online RLHF methods as special cases. We significantly improve alignment performance on open-sourced datasets with minimal computational overhead.

Submission Number: 59

Loading