SIPO: Stabilized and Improved Preference Optimization for Aligning Diffusion Models

Xiaomeng Yang; Mengping Yang; Junyan Wang; Zhijian Zhou; Zhiyu Tan; Hao Li

SIPO: Stabilized and Improved Preference Optimization for Aligning Diffusion Models

Xiaomeng Yang, Mengping Yang, Junyan Wang, Zhijian Zhou, Zhiyu Tan, Hao Li

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Diffusion, DPO, image generate, video generate

TL;DR: We propose a stabilized and improved preference optimization framework for aligning diffusion generative models with human perferences.

Abstract: Preference learning has garnered extensive attention as an effective technique for aligning diffusion models with human preferences in visual generation tasks. However, existing alignment approaches such as Diffusion-DPO suffer from two fundamental challenges: training instability caused by high gradient variances at various timesteps and high parameter sensitivities, and off-policy bias arising from the discrepancy between the optimization data and the policy model's distribution. Our first contribution is a systematical analysis of the diffusion trajectories across different timesteps and identify that the instability primarily originates from early timesteps with low importance weights. To address these issues, we propose SIPO, a Stabilized and Improved preference Optimization framework for aligning diffusion models with human preferences. Concretely, a key gradient, \emph{i.e.,} DPO-C\&M is introduced to facilitate stabilize training by clipping and masking uninformative timesteps. Followed by a timestep aware importance re-weighting paradigm to fully correct off-policy bias and emphasize informative updates throughout the alignment process. Extensive experiments on various baseline models, including image generation models on SD1.5, SDXL, and video generation models CogVideoX-2B, CogVideoX-5B, and Wan2.1-1.3B, demonstrate that our SIPO consistently promotes stabilized training and outperforms existing alignment methods, with meticulous adjustments on parameters. Overall, these results highlight the importance of timestep-aware alignment and and provide valuable guidelines for improved preference optimization in diffusion models.

Supplementary Material: pdf

Primary Area: generative models

Submission Number: 23238

Loading