Keywords: Diffusion Model, Diffusion Transformer, Generalization, Inductive Bias
TL;DR: The generalization of a DiT is influenced by the inductive bias of attention locality rather than harmonic bases like UNet. Using attention window restrictions can modify its generalization ability.
Abstract: Recent work studying the generalization of diffusion models with locally linear UNet-based denoisers reveals inductive biases that can be expressed via geometry-adaptive harmonic bases. For such locally linear UNets, these geometry-adaptive harmonic bases can be conveniently visualized through the eigen-decomposition of a UNet’s Jacobian matrix. In practice, however, more recent denoising networks are often transformer-based, e.g., the diffusion transformer (DiT). Due to the presence of nonlinear operations, similar eigen-decomposition analyses cannot be used to reveal the inductive biases of transformer-based denoisers. This motivates our search for alternative ways to explain the strong generalization ability observed in DiT models. Investigating a DiT’s pivotal attention modules, we find that locality of attention maps in a DiT’s early layers are closely associated with generalization. To verify this finding, we modify the generalization of a DiT by restricting its attention windows and observe an improvement in generalization. Furthermore, we empirically find that both the placement and the effective attention size of these local attention windows are crucial factors. Experimental results on the CelebA, ImageNet, MSCOCO, and LSUN data show that strengthening the inductive bias of a DiT can improve both generalization and generation quality when less training data is available. Source code is available at https://github.com/DiT-Generalization/DiT-Generalization.
Supplementary Material: zip
Primary Area: Deep learning (e.g., architectures, generative models, optimization for deep networks, foundation models, LLMs)
Submission Number: 14584
Loading