Keywords: RLHF, Alignment, Iterative Alignment
TL;DR: Online Adaptive Direct Preference Optimization
Abstract: Reinforcement Learning from Human Feedback (RLHF) is a key method for aligning large language models (LLMs) with human preferences. Current offline RLHF methods rely on fixed preference datasets, which can lead to sub-optimal performance. Current online RLHF methods lack a unified conceptual formulation and suffer from distribution shifts. We establish that online LLM alignment is underpinned by bilevel optimization. By reducing this formulation to an efficient single-level first-order method (using the reward-policy equivalence), our approach generates new samples and iteratively refines model alignment. Thus, we perform alignment in an online and self-improving manner and generalize prior online RLHF methods as special cases. We significantly improve alignment performance on open-sourced datasets with minimal computational overhead.
Submission Number: 59
Loading