Diffusion-Model Layers May Exhibit Diffusive Behavior at Each Step for Noise Estimation

16 Sept 2025 (modified: 12 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Deep learning, Computer vision, Generative models, Diffusion Models
TL;DR: We hypothesize that diffusion model layers exhibit diffusive behavior, which manifests as an internal, gradual diffusion process at each sampling step.
Abstract: How do diffusion models process inputs at each step? The model transforms the input toward higher noise levels until it reaches pure noise prediction. We hypothesize that model layers exhibit diffusive behavior, which manifests as an internal, gradual diffusion process at each sampling step. Based on this insight, we introduce Depth-varying Diffusion (DvD), characterized by two key features: 1) Progressively stacking model layers across sampling steps (i.e., from T to 0) until it reaches the baseline depth. As sampling progresses, the residual noise within the input becomes subtle, requiring more intense diffusion, and thus a deeper model, to map it into pure noise. 2) Enforcing supervision on both intermediate and final outputs. Since stacked layers yield a cascade of “input to intermediate state to pure noise”, we propose that the input should reach a specific intermediate state during the gradual diffusion process. We mathematically derive that in DvD, and correspondingly use the derived results for supervision. Experimental results demonstrate the effectiveness of these two features, showing improvements in generation quality while also reducing inference cost.
Primary Area: generative models
Submission Number: 8012
Loading