TL;DR: This paper introduces and analyzes an alternative to direct preference optimization that relies on no implicit reward.
Abstract: The generated responses of large language models (LLMs) are often fine-tuned to human preferences through a process called reinforcement learning from human feedback (RLHF). As RLHF relies on a challenging training sequence, whereby a separate reward model is independently learned and then later applied to LLM policy updates, ongoing research effort has targeted more straightforward alternatives. In this regard, direct preference optimization (DPO) and its many offshoots circumvent the need for a separate reward training step. Instead, through the judicious use of a reparameterization trick that induces an implicit reward, DPO and related methods consolidate learning to the minimization of a single loss function. And yet despite demonstrable success in some real-world settings, we prove that DPO-based objectives are nonetheless subject to sub-optimal regularization and counter-intuitive interpolation behaviors, underappreciated artifacts of the reparameterizations upon which they are based. To this end, we introduce an explicit preference optimization framework termed EXPO that requires no analogous reparameterization to achieve an implicit reward. Quite differently, we merely posit intuitively-appealing regularization factors from scratch that transparently avoid the potential pitfalls of key DPO variants, provably satisfying regularization desiderata that prior methods do not. Empirical results serve to corroborate our analyses and showcase the efficacy of EXPO.
Lay Summary: Large language models are often fine-tuned to match human preferences by first training a separate reward model and then using it to guide subsequent refinement steps, a process that can be complex, unstable, and slow. Direct preference optimization (DPO) simplifies this pipeline by using a mathematical reparameterization to create an implicit reward without the requirement of a separate explicit reward model; however, this shortcut can introduce unintended, counterintuitive behaviors and degeneracies. To address these issues, we propose EXPO, which directly forms the training objective using intuitive penalty factors and no dependency on subtle reparameterizations. This transparent approach avoids some DPO shortcomings, and across empirical testing, matches or surpasses its performance in aligning models with human preferences.
Link To Code: https://github.com/lmkong020/explicit-preference-optimization
Primary Area: Deep Learning->Large Language Models
Keywords: direct preference optimization, reinforcement learning from human feedback, preference alignment, regularized regression
Submission Number: 15908
Loading