Robust Direct Preference Optimization is Effective in Offline Learning

Robust Direct Preference Optimization is Effective in Offline Learning

ACL ARR 2026 January Submission10359 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Preference, RLHF, DPO, LLM

Abstract: Direct Preference Optimization (DPO) is widely studied and used in preference alignment problems. However, when the offline datasets are sparse, limited, imbalanced, or noisy, due to the constrained collection processes, DPO may suffer from performance degradation. Inspired by the pessimism principle in offline learning, we propose Robust DPO (rDPO), a pessimistic preference-optimization framework that accounts for dataset uncertainties by optimizing against the worst-case latent reward within a data-dependent uncertainty set. We show that our rDPO enjoys a simple structure and can be directly fine-tuned from the vanilla DPO policies. Moreover, We theoretically prove the effectiveness of our rDPO, showing it learns a policy robust to the dataset uncertainties. We further empirically verify that rDPO improves robustness in both controlled synthetic environments under sparse/noisy comparisons, and language-model preference tuning under targeted corruption.

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: reinforcement learning in agents, reinforcement learning in agents

Contribution Types: Theory

Languages Studied: English

Submission Number: 10359

Loading