Common Learning Constraints Alter Interpretations of Direct Preference Optimization

Published: 22 Jan 2025, Last Modified: 11 Mar 2025AISTATS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We demonstrate that unavoidable learning constraints interfere with the original motivation for direct preference optimization (DPO), and subsequently derive DPO from a new perspective.
Abstract: Large language models in the past have typically relied on some form of reinforcement learning with human feedback (RLHF) to better align model responses with human preferences. However, because of oft-observed instabilities when implementing these RLHF pipelines, various reparameterization techniques have recently been introduced to sidestep the need for separately learning an RL reward model. Instead, directly fine-tuning for human preferences is achieved via the minimization of a single closed-form training objective, a process originally referred to as direct preference optimization (DPO). Although effective in certain real-world settings, we detail how the foundational role of DPO reparameterizations (and equivalency to applying RLHF with an optimal reward) may be obfuscated once inevitable optimization constraints are introduced during model training. This then motivates alternative derivations and analysis of DPO that remain intact even in the presence of such constraints. As initial steps in this direction, we re-derive DPO from a simple Gaussian estimation perspective, with strong ties to compressive sensing and classical constrained optimization problems involving noise-adaptive, concave regularization.
Submission Number: 740
Loading