Keywords: Video Generation; Post-training; Diffusion Models
Abstract: Recent preference alignment strategies have gained traction in large language models (LLMs) and are now being extended to broader generative domains. Approaches such as Direct Preference Optimization have been adapted to diffusion models by leveraging human-labeled preferences or auxiliary score models to distinguish ``winner'' from ``loser''.
However, these methods face two key challenges: (1) the optimization process often overfits to the score model, resulting in suboptimal generation quality; and (2) the results generated from the same text prompt exhibit significant divergence, resulting in limited effective gradients and reduced training efficiency. These limitations are further exacerbated in video generation, where evaluation is more complex and inference is slower. In this work, we introduce Self-Discriminative Optimization that using only a handful of real samples, unlocks markedly higher-quality generation. First, we introduce self-degradation that applies frequency-domain reweighting to the latent representations from real samples, yielding degraded samples that more closely match the model’s original output distribution. This leads to controlled distortions such as low-quality, temporal inconsistency and object deformation.
We then use these real/degraded pairs as positive and negative examples to fine-tune the pretrained model discriminatively with automatically assigned, reliable labels.
By exploiting the richer gradients from these controllable degradation pairs, our experiments demonstrate substantial gains in structural quality and semantic alignment using only a handful of high-quality samples and minimal fine-tuning.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 9992
Loading