What Architectural Inductive Bias Makes Diffusion Models Succeed? A Perspective from the Implicit Regularization of Gradient Descent

Published: 26 May 2026, Last Modified: 26 May 2026ICML 2026 FoGen Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion model, inductive bias, implicit regularization
Abstract: Diffusion and flow-based models succeed by training a neural network to predict noise or velocity from corrupted inputs. But why this training succeeds is not fully explained by the denoising objective alone, because the same objective can fail completely when the architecture of the denoiser changes. We study the role of architecture through the lens of gradient dynamics. The key property we identify is sparse connectivity: each neuron receives input from only a small subset of coordinates, a design shared across convolutional and transformer denoisers. We prove that sparse connectivity makes memorization strictly harder than in fully connected networks by shifting the implicit regularization of gradient descent away from the ambient input geometry and onto a collection of low-dimensional patches. Controlled denoising experiments corroborate this theory, and an extension to deep denoisers shows that clean-data prediction keeps internal representations lower-dimensional across layers. Our results point to a concrete mechanism of architectural inductive bias: the architecture determines the geometry on which gradient descent operates, and through this geometry it shapes which solutions training can find.
Submission Number: 229
Loading