Keywords: Preference, RLHF, DPO, LLM
Abstract: Direct Preference Optimization (DPO) is widely studied and used in preference alignment problems. However, when the offline datasets are sparse, limited, imbalanced, or noisy, due to the constrained collection processes, DPO may suffer from performance degradation. Inspired by the pessimism principle in offline learning, we propose Robust DPO (rDPO), a pessimistic preference-optimization framework that accounts for dataset uncertainties by optimizing against the worst-case latent reward within a data-dependent uncertainty set. We show that our rDPO enjoys a simple structure and can be directly fine-tuned from the vanilla DPO policies. Moreover, We theoretically prove the effectiveness of our rDPO, showing it learns a policy robust to the dataset uncertainties. We further empirically verify that rDPO improves robustness in both controlled synthetic environments under sparse/noisy comparisons, and language-model preference tuning under targeted corruption.
Paper Type: Long
Research Area: AI/LLM Agents
Research Area Keywords: reinforcement learning in agents, reinforcement learning in agents
Contribution Types: Theory
Languages Studied: English
Submission Number: 10359
Loading