Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

ICLR 2026 Conference Submission6206 Authors

Published: 26 Jan 2026, Last Modified: 26 Jan 2026ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Model
TL;DR: diffusion dpo, RLHF
Abstract: Human visual preferences are inherently multi-dimensional, encompassing aspects of aesthetics, detail fidelity, and semantic alignment. However, existing open-source preference datasets provide only single, holistic annotations, resulting in severe label noise—images that excel in some dimensions (e.g., compositional) but are deficient in others (e.g., details) are simply marked as ``winner" or ``loser". We theoretically demonstrate that this compression of multi-dimensional preferences into binary labels generates conflicting gradient signals that misguide the optimization process in Diffusion Direct Preference Optimization (DPO). To address this label noise from conflicting multi-dimensional preferences, we propose Semi-DPO, a semi-supervised learning approach. We treat pairs with consistent preferences across all dimensions as clean labeled data, while those with conflicting signals are considered noisy unlabeled data. Our method first trains a model on a clean, consensus-filtered data subset. This model then acts as its own implicit classifier to generate pseudo-labels for the larger, noisy set, which are used to iteratively refine the model's alignment. This approach effectively mitigates label noise and enhances image generation quality, achieving better alignment with multi-dimensional human preferences. Experimental results demonstrate that Semi-DPO significantly improves alignment with multi-dimensional human preferences, achieving state-of-the-art performance without requiring additional human annotation or the need to train a dedicated reward models.
Primary Area: generative models
Submission Number: 6206
Loading