Keywords: alignment, dpo, preferences, pessimism, rlhf, llms
TL;DR: We propose a robust objective for offline preference alignment where an LM's implicit reward model is trained to match the scores of an explicit reward model through a distillation loss.
Abstract: Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Recently, Direct Preference Optimization (DPO) has gained popularity as an offline alignment method that is directly supervised with human preference annotations. However, DPO is overconfident about preference annotations, implicitly assigning them rewards of infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probability of the preferred output to go to zero. In this work, we propose to use distillation to combat overconfidence: we train the LM to match the reward distribution induced by a model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a family of reward models, which may be instantiated either implicitly or explicitly. Our results show that both measures lead to improved robustness to distribution shift in preference annotations, while preserving the supervised nature of DPO.
Primary Area: Natural language processing
Submission Number: 20267
Loading