Origins of Creativity in Attention Based Diffusion Models

Published: 09 Jun 2025, Last Modified: 09 Jul 2025HiLD at ICML 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Diffusion, Attention, Creativity, Memorization, Generalization
TL;DR: We extend existing theory, which explains how CNN-based diffusion models generate creative images, to include attention-based models; our theory predicts that attention enforces global consistency, and we validate this empirically on synthetic data.
Abstract: As diffusion models have become the tool of choice for image generation and as the quality of the images continues to improve, the question of how creativity originates in diffusion has become increasingly important. The score matching perspective on diffusion has proven particularly fruitful for understanding how and why diffusion models generate images that remain visually plausible while differing significantly from their training images. In particular, as explained in (Kamb \& Ganguli, 2024) and others, e.g., (Ambrogioni, 2023), theory suggests that if our score matching were optimal, we would only be able to recover training samples through our diffusion process. However, as shown by Kamb \& Ganguli, (2024), in diffusion models where the score is parametrized by a simple CNN, the inductive biases of the CNN itself (translation equivariance and locality) allow the model to generate samples that globally do not match any training samples, but are rather patch-wise `mosaics'. Despite the widespread use of UNet architectures with self‐attention as the score backbone in diffusion models, the theoretical role of attention in score networks remains largely unexplored. In this work, we take a preliminary step in this direction to extend this theory to the case of diffusion models whose score is parametrized by a CNN with a final self-attention layer. We show that our theory suggests that self-attention will induce a globally image-consistent arrangement of local features beyond the patch-level in generated samples, and we verify this behavior empirically on a carefully crafted dataset.
Student Paper: Yes
Submission Number: 52
Loading