Abstract: Traditional RLHF-based LLM alignment methods and direct alignment counterparts like DPO assume a Bradley-Terry model of pairwise preferences.This assumption is challenged by non-deterministic or noisy preference labels, such as scoring of two candidate outputs with low confidence or low reward difference. This paper introduces **DRDO (Direct Reward Distillation and policy-Optimization)**, which simultaneously models rewards *and* preferences to avoid such degeneracy. DRDO directly mimics rewards assigned by an oracle while learning human preferences with a novel preference likelihood formulation, while being fully *offline*. Results on Ultrafeedback, TL;DR, and AlpacaEval 2.0 show that DRDO-trained policies surpass methods such as DPO and e-DPO in terms of expected rewards and are more robust to noisy preference signals and out-of-distribution (OOD) settings.
Paper Type: Long
Research Area: Generation
Research Area Keywords: LLM alignment, preference modeling, reward modeling
Contribution Types: NLP engineering experiment, Theory
Languages Studied: English
Submission Number: 4833
Loading