Robust Direct Preference Optimization via Variational form f-divergence

16 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Direct Preference Optimization, $f$-divergence, aligning with preference noise
TL;DR: We investigate the robustness of certain variational form f-divergence when learning with noisy preference text data.
Abstract: Direct Preference Optimization is commonly deployed to align Large Language Models (LLMs) with human preferences, while such a technique also suffers from noisily annotated human preference. Existing robust approaches often require the knowledge of transition between clean and noisy human preferences, or leverage additional architecture/models to perform noisy human preference correction. In this work, we investigate when $f$-divergence is immune to the imperfect human preference annotations, by maximizing the $f$-divergence between noisy preferred and unpreferred data distributions. Theoretically, we show that when the noise ratio is known, the Total Variation formulation can serve as a surrogate for the clean dataset. In contrast, the Jensen–Shannon formulation is invariant to noise, yielding identical results under both noisy and clean preferences, even without knowledge of the noise rate. Empirically, the variational form of the Jensen–Shannon divergence enhances the model’s ability to generate preferred responses under noisy conditions, while simultaneously improving the factual accuracy of its outputs.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 7995
Loading