Improvement-Guided Iterative DPO for Diffusion Models

Published: 10 Jun 2025, Last Modified: 30 Jun 2025MoFA PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion models, DPO, RLHF, AI generated feedback
Abstract: Direct Preference Optimization (DPO) has been shown to be an effective solution in aligning generative models with human preferences. The recent deep dive shows that DPO's performance is constrained by the offline preference dataset. To solve this challenge, this paper introduces a novel improvement-guided approach for online iterative optimization of the diffusion models without extra annotation. We propose to learn an improvement model to extract the implicit preference improvement direction from the preference dataset. The learned improvement model is then used to generate winning images given the images generated by the current diffusion model as losing images. Thus, the improvement model can guide iterative DPO by generating such online preference datasets repeatedly. This method enables online improvement beyond offline DPO training without requiring additional human labeling or risking overfitting the reward model. Results demonstrate improvements in preference alignment with higher diversity compared with other fine-tuning methods. Our work bridges the gap between offline preference learning and online improvement, offering a promising direction for enhancing diffusion models in image generation tasks with limited preference data.
Submission Number: 34
Loading