Keywords: Classifier Free Guidance, Diffusion Guidance, Guidance Distillation, Text to Image Synthesis
TL;DR: We train a noise-refining network that refines Gaussian initial noise to encode an initial layout, enabling high-quality samples without guidance (e.g., classifier-free guidance) during denoising.
Abstract: Diffusion models have demonstrated remarkable image generation capabilities, but their performance heavily relies on sampling guidance such as classifier-free guidance (CFG). While sampling guidance significantly enhances image quality, it requires two forward passes at every denoising step, leading to substantial computational overhead. Existing approaches mitigate this cost through distillation, training a student network to learn the guided predictions. In contrast, we take a distinct approach by refining the initial Gaussian noise, a critical yet under-explored factor in the diffusion-based generation pipelines.
We introduce a noise refinement framework, NoiseRefine, where a refining network is trained to minimize the difference between images generated by unguided sampling from the refined noise and those produced by guided sampling from the input Gaussian noise.
This simple approach demonstrates that images from the refined noise alleviate artifacts and mitigate structural collapse, achieving significantly higher quality than those generated from pure Gaussian noise without modifying the diffusion model, thereby preserving its prior knowledge and compatibility with finetuned or timestep distilled variants.
Beyond its practical benefits, we provide an in-depth analysis of refined noise, offering insights into its role in the denoising process and its interaction with guidance. Our findings suggest that structured noise initialization is key to efficient and high-fidelity image synthesis.
Project page: https://cvlab-kaist.github.io/NoiseRefine/
Primary Area: generative models
Submission Number: 8794
Loading