Elucidating Guidance in Variance Exploding Diffusion Models: Fast Convergence and Better Diversity

ICLR 2026 Conference Submission16784 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion Models, Guidance-based Method, Convergence Analysis
Abstract: Recently, the conditional diffusion models have shown an impressive performance in many areas, such as text-to-image, 3D, and video. To achieve a better alignment with the given condition, guidance-based methods are proposed and become a standard component of diffusion models. Though the guidance-based methods are widely used, the theoretical guarantee mainly focuses on the variance-preserving (VP)-based models and lacks current state-of-the-art (SOTA) variance-exploding (VE) models for conditional generation. In this work, for the first time, we elucidate the influence of guidance for VE models and explain why VE-based models perform better than VP models in the context of Gaussian mixture models from classification confidence and diversity perspectives. For the classification confidence, we prove the convergence rate for the confidence w.r.t. the strength of guidance $\eta$ for VE models is $1-\eta^{-1}(\log \eta)^2$, which is faster than $1-\eta^{-e^{-T}}(\log \eta)^{2 e^{-T}}$ result for VP models ($T$ is the diffusion time). This result indicates that the VE models have a stronger ability to align with the given condition, which is important for the conditional generation. For the diversity, previous works show that when facing strong guidance, VP models tend to generate extreme samples and suffer from the modal collapse phenomenon. However, for VE models, we show that since their forward process maintains the multi-modal property of data, they have a better ability to avoid the modal collapse facing strong guidance. The simulation and real-world experiments also support our theoretical results.
Primary Area: generative models
Submission Number: 16784
Loading