Unavoidable Learning Constraints Alter the Foundations of Direct Preference Optimization

Published: 18 Jun 2024, Last Modified: 17 Jul 2024TF2M 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: direct preference optimization, reinforcement learning with human feedback
TL;DR: We demonstrate how the introduction of constraints during learning can impact the foundations and interpretation of DPO.
Abstract: Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO). Although effective in certain real-world settings, we detail how the foundational DPO reparameterization no longer holds once inevitable learning constraints are introduced during model training. This then motivates alternative derivations and analysis of DPO that remain intact even in the presence of such constraints. As initial steps in this direction, we re-derive DPO from a simple Gaussian estimation perspective, with strong ties to classical constrained optimization problems involving noise-adaptive, concave regularization.
Submission Number: 46
Loading