OUTLIER-AWARE PREFERENCE OPTIMIZATION FOR LARGE LANGUAGE MODELS

Published: 05 Mar 2025, Last Modified: 01 Apr 2025QUESTION PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: OOD, Preference Optimization, DPO, Alignment
TL;DR: Dynamic alignment of LLMs using energy-based OOD detection to refine policy and reward models with targeted oracle feedback.
Abstract: Aligning large language models (LLMs) to user preferences often relies on learning a reward model as a proxy from feedback. However, such reward models can fail on out-of-distribution examples and, if kept static, may reinforce incorrect preferences. We propose a dynamic alignment method that uses an energy-based out-of-distribution (OOD) scoring mechanism to identify potential misjudgments, then judiciously collects oracle feedback to refine both the policy and reward model. By focusing on the OOD examples, our approach iteratively improves alignment and robustness in preference-based training. Empirically, we show that our method enhances the policy model’s generative capabilities on the MT-Bench benchmark and improves the reward model’s judgment capability on RewardBench.
Submission Number: 39
Loading