Mind the (Preference) Gap: Mitigating Reward-Preference Misalignment with Reward Distillation

Mind the (Preference) Gap: Mitigating Reward-Preference Misalignment with Reward Distillation

ACL ARR 2025 May Submission4833 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Traditional RLHF-based LLM alignment methods and direct alignment counterparts like DPO assume a Bradley-Terry model of pairwise preferences.This assumption is challenged by non-deterministic or noisy preference labels, such as scoring of two candidate outputs with low confidence or low reward difference. This paper introduces **DRDO (Direct Reward Distillation and policy-Optimization)**, which simultaneously models rewards *and* preferences to avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while learning human preferences with a novel preference likelihood formulation, while being fully *offline*. Results on Ultrafeedback, TL;DR, and AlpacaEval 2.0 show that DRDO-trained policies surpass methods such as DPO and e-DPO in terms of expected rewards and are more robust to noisy preference signals and out-of-distribution (OOD) settings.

Paper Type: Long

Research Area: Generation

Research Area Keywords: LLM alignment, preference modeling, reward modeling

Contribution Types: NLP engineering experiment, Theory

Languages Studied: English

Submission Number: 4833

Loading