Keywords: Diffusion
Abstract: Classifier-free guidance (CFG) is a fundamental technique for flow-based models, significantly enhancing visual quality and prompt adherence.
However, the guidance scale is typically tuned empirically due to instability at higher values, which often induces visual artifacts and mode collapse.
This paper investigates the underlying mechanisms driving this instability and proposes an effective solution.
Our analysis reveals that high CFG scales induce a detrimental distribution shift in the velocity prediction, damaging the generation fidelity.
To address this, we introduce TCG, a novel plug-and-play method comprising two key components: (1) Moment Matching (MM), which stabilizes the velocity distribution by aligning its first two moments (mean and variance), thereby preventing mode collapse; and (2) Adaptive Clipping (AdapC), which dynamically constrains the guidance update term from both temporal and spatial perspectives to ensure smooth and stable sampling.
As a result, our method enables robust and high-quality generation across a wide range of guidance scales.
Extensive experiments on diverse text-to-image and text-to-video benchmarks validate that our method outperforms both standard CFG and its state-of-the-art variants.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 2078
Loading