Smoothed Preference Optimization via ReNoise Inversion for Aligning Diffusion Models with Varied Human Preferences
TL;DR: An efficient method designed to align diffusion models with varied human preferences.
Abstract: Direct Preference Optimization (DPO) aligns text-to-image (T2I) generation models with human preferences using pairwise preference data. Although substantial resources are expended in collecting and labeling datasets, a critical aspect is often neglected: *preferences vary across individuals and should be represented with more granularity.* To address this, we propose SmPO-Diffusion, a novel method for modeling preference distributions to improve the DPO objective, along with a numerical upper bound estimation for the diffusion optimization objective. First, we introduce a smoothed preference distribution to replace the original binary distribution. We employ a reward model to simulate human preferences and apply preference likelihood averaging to improve the DPO loss, such that the loss function approaches zero when preferences are similar. Furthermore, we utilize an inversion technique to simulate the trajectory preference distribution of the diffusion model, enabling more accurate alignment with the optimization objective. Our approach effectively mitigates issues of excessive optimization and objective misalignment present in existing methods through straightforward modifications. Experimental results demonstrate that our method achieves state-of-the-art performance in preference evaluation tasks, surpassing baselines across various metrics, while reducing the training costs.
Lay Summary: (1) When training AI systems to generate images using "thumbs up/down" feedback, current methods treat everyone’s preferences as identical—like asking people to rate art as only "good" or "bad," ignoring the rich diversity of human taste. (2) We created SmPO-Diffusion, a new technique that captures subtle differences in preferences. Imagine a "virtual jury" that mimics how different people might score an image. We then teach the AI to balance these preferences mathematically, similar to a chef refining a recipe based on diners’ varied feedback. Additionally, we developed a way to track the AI’s creative decisions step-by-step, ensuring it aligns better with what users truly want. (3) Experiments show our method boosts image quality ratings by 16.7%, cuts training time by 79.8%, and generates more creative outputs.
Link To Code: https://github.com/JaydenLyh/SmPO
Primary Area: Applications->Computer Vision
Keywords: diffusion models, human preference, preference optimization
Submission Number: 3130
Loading