Direct Discriminative Optimization: Your Likelihood-Based Visual Generative Model is Secretly a GAN Discriminator

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 spotlightposterEveryoneRevisionsBibTeXCC BY-NC-SA 4.0
TL;DR: an efficient and effective finetuning method for enhancing diffusion models and visual autoregressive models
Abstract: While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective, which minimizes the forward KL divergence, inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that integrates likelihood-based generative training and GAN-type discrimination to bypass this fundamental constraint by exploiting reverse KL and self-generated negative signals. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1\% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58/1.96 to new records of 1.30/0.97/1.26 on CIFAR-10/ImageNet-64/ImageNet 512$\times$512 datasets without any guidance mechanisms, and by consistently improving both guidance-free and CFG-enhanced FIDs of visual autoregressive models on ImageNet 256$\times$256.
Lay Summary: Modern AI image generators can produce stunning visuals, but the way they are trained still has a key weakness: they tend to play it safe by trying to cover all possible outputs, which leads to blurry or less realistic images when the model size is limited. This happens because the most common training method (called maximum likelihood estimation) encourages covering every possibility rather than focusing on the most likely or high-quality results. To overcome this, we introduce a new approach called Direct Discriminative Optimization (DDO). It improves training by helping the model learn from its own mistakes — identifying and correcting low-quality outputs — without needing a separate "judge" model like in GANs. Inspired by recent techniques in AI alignment, DDO works by comparing a model to a fixed reference and using that difference as a learning signal. DDO can upgrade existing models efficiently and drastically improve image quality. In our experiments, it helped leading AI models produce sharper, more realistic images across several standard benchmarks — outperforming previous best results without needing extra tricks or more training data.
Link To Code: https://github.com/NVlabs/DDO
Primary Area: Deep Learning->Generative Models and Autoencoders
Keywords: Diffusion Models, Visual Autoregressive Models, GAN, Generation Quality
Submission Number: 5007
Loading