Outlier-Aware Preference Optimization for Large Language Models

Pragya Srivastava; Sai Soumya Nalli; Amit Deshpande; Amit Sharma

Outlier-Aware Preference Optimization for Large Language Models

Pragya Srivastava, Sai Soumya Nalli, Amit Deshpande, Amit Sharma

Published: 06 Mar 2025, Last Modified: 05 May 2025ICLR 2025 Bi-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reward Modeling; OOD detection; DPO; Alignment

TL;DR: We dynamically refine LLM alignment by detecting OOD reward model failures and selectively gathering oracle feedback to improve robustness and performance.

Abstract: Aligning large language models (LLMs) to user preferences often relies on learning a reward model as a proxy from feedback. However, such reward models can fail on out-of-distribution examples and, if kept static, may reinforce incorrect preferences. We propose a dynamic alignment method that uses an energybased out-of-distribution (OOD) scoring mechanism to identify potential misjudgments, then judiciously collects oracle feedback to refine both the policy and reward model. By focusing on the OOD examples, our approach iteratively improves alignment and robustness in preference-based training. Empirically, we show that our method enhances the policy model’s generative capabilities on the LM Eval Harness benchmark and improves the reward model’s judgment capability on RewardBench.

Submission Type: Long Paper (9 Pages)

Archival Option: This is an archival submission

Presentation Venue Preference: ICLR 2025

Submission Number: 76

Loading