Dual Caption Preference Optimization for Diffusion Models

Published: 15 Oct 2025, Last Modified: 15 Oct 2025Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent advancements in human preference optimization, originally developed for Large Language Models (LLMs), have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of preferred samples while distinguishing them from less preferred ones. However, within the existing preference datasets, the original caption often does not clearly favor the preferred image over the alternative, which weakens the supervision signal available during training. To address this issue, we introduce Dual Caption Preference Optimization (DCPO), a data augmentation and optimization framework that reinforces the learning signal by assigning two distinct captions to each preference pair. This encourages the model to better differentiate between preferred and less-preferred outcomes during training. We also construct Pick-Double Caption, a modified version of Pick-a-Pic v2 with separate captions for each image, and propose three different strategies for generating distinct captions: captioning, perturbation, and hybrid methods. Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFT_Chosen, Diffusion-DPO and MaPO across multiple metrics, including Pickscore, HPSv2.1, GenEval, CLIPscore, and ImageReward, fine-tuned on SD 2.1 as the backbone.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: 1. **Clarification of conflicting distributions:** We have revised the abstract, introduction, and challenge sections to justify and clarify the notion of “conflicting distributions.” We also refined our presentation to better highlight our method as a form of data augmentation with stronger performance, in line with your suggestion. 2. **Perturbation Selection Process:** We added more detailed explanations in the perturbation section to clearly describe the process. 3. **Additional experimental validations:** We expanded the experimental section to include (i) hyperparameter tuning, (ii) a human study demonstrating the alignment between GPT-4o judgments and human evaluations, (iii) evaluations using Alternative MLLM-as-a-Judge for Controllable Perturbation, and (iv) experiments on both online and iterative methods. Each of these additions is accompanied by extended explanations to enhance clarity. 4. **Expanded appendix:** We provided detailed descriptions and results for each experiment in the appendix to ensure full transparency and reproducibility.
Code: https://github.com/sahsaeedi/DCPO/
Supplementary Material: zip
Assigned Action Editor: ~Jia-Bin_Huang1
Submission Number: 4940
Loading