Keywords: Reward Modeling; OOD detection; DPO; Alignment
TL;DR: We dynamically refine LLM alignment by detecting OOD reward model failures and selectively gathering oracle feedback to improve robustness and performance.
Abstract: Aligning large language models (LLMs) to user preferences often relies on learning a reward model as a proxy from feedback. However, such reward models can fail on out-of-distribution examples and, if kept static, may reinforce incorrect preferences. We propose a dynamic alignment method that uses an energybased out-of-distribution (OOD) scoring mechanism to identify potential misjudgments, then judiciously collects oracle feedback to refine both the policy and reward model. By focusing on the OOD examples, our approach iteratively improves alignment and robustness in preference-based training. Empirically, we show that our method enhances the policy model’s generative capabilities on the LM Eval Harness benchmark and improves the reward model’s judgment capability on
RewardBench.
Submission Type: Long Paper (9 Pages)
Archival Option: This is an archival submission
Presentation Venue Preference: ICLR 2025
Submission Number: 76
Loading