D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples

Zijing Hu; Fengda Zhang; Kun Kuang

D-Fusion: Direct Preference Optimization for Aligning Diffusion Models with Visually Consistent Samples

Zijing Hu, Fengda Zhang, Kun Kuang

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This paper introduces D-Fusion, a compatible approach that constructs DPO-trainable visually consistent samples, to further enhance the prompt-image alignment of diffusion models with reinforcement learning.

Abstract: The practical applications of diffusion models have been limited by the misalignment between generated images and corresponding text prompts. Recent studies have introduced direct preference optimization (DPO) to enhance the alignment of these models. However, the effectiveness of DPO is constrained by the issue of visual inconsistency, where the significant visual disparity between well-aligned and poorly-aligned images prevents diffusion models from identifying which factors contribute positively to alignment during fine-tuning. To address this issue, this paper introduces D-Fusion, a method to construct DPO-trainable visually consistent samples. On one hand, by performing mask-guided self-attention fusion, the resulting images are not only well-aligned, but also visually consistent with given poorly-aligned images. On the other hand, D-Fusion can retain the denoising trajectories of the resulting images, which are essential for DPO training. Extensive experiments demonstrate the effectiveness of D-Fusion in improving prompt-image alignment when applied to different reinforcement learning algorithms.

Lay Summary: Diffusion models are powerful tools for generating images from text, but they often produce images that don't quite match what users ask for. A recent technique called Direct Preference Optimization (DPO) tries to fix this by teaching the model what users prefer. However, it struggles to learn effectively when the training images look too different from each other, since it is hard to figure out which changes lead to better matches between images and text. Our research proposes a new method called D-Fusion to solve this problem. D-Fusion creates pairs of training images that are more visually similar but differ in how well they match the text prompt. It also keeps track of important intermediate states when creating images, which are needed by DPO. By applying DPO with these images, the model can effectively learn how to generate images that match the given text. This work brings us closer to building image generation tools that follow instructions more faithfully.

Primary Area: Deep Learning->Generative Models and Autoencoders

Keywords: Diffusion Models, Alignment, Reinforcement Learning

Submission Number: 2703

Loading