Abstract: Fine-tuning techniques such as Direct Preference Optimization (DPO) allow one to better align Large Language Models (LLMs) with human preferences. Recent adoption of DPO to diffusion modeling and its derivative works have proven to work effectively in improving visual appeal and prompt-image alignment. However, these works fine-tune on preference datasets labeled by human annotators, which are inherently subjective and prone to noisy labels. We hypothesize that fine-tuning on these subjective preferences does not lead to optimal model alignment. To address this, we develop a quality metric to rank image preference pairs and achieve more effective Diffusion-DPO fine-tuning. We fine-tune using incremental subsets of this ranked dataset and show that diffusion models fine-tuned using only the top 5.33\% of the data perform better both quantitatively and qualitatively than the models fine-tuned on the full dataset. Furthermore, we leverage this quality metric and our diverse prompt selection strategy to synthesize a new paired preference dataset. We show that fine-tuning on this new dataset achieves better results than the models trained using human labeled datasets. The code is available at https://anonymous.4open.science/r/DPO-QSD-28D7/README.md.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Lu_Jiang1
Submission Number: 5995
Loading