How Compositional Generalization and Creativity Improve as Diffusion Models are Trained

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Natural data is often organized as a hierarchical composition of features. How many samples do generative models need in order to learn the composition rules, so as to produce a combinatorially large number of novel data? What signal in the data is exploited to learn those rules? We investigate these questions in the context of diffusion models both theoretically and empirically. Theoretically, we consider a simple probabilistic context-free grammar - a tree-like graphical model used to represent the hierarchical and compositional structure of data such as language and images. We demonstrate that diffusion models learn the grammar's composition rules with the sample complexity required for clustering features with statistically similar context, a process similar to the word2vec algorithm. However, this clustering emerges hierarchically: higher-level features associated with longer contexts require more data to be identified. This mechanism leads to a sample complexity that scales polynomially with the said context size. As a result, diffusion models trained on an intermediate dataset size generate data coherent up to a certain scale, but lacking global coherence. We test these predictions across different domains and find remarkable agreement: both generated texts and images achieve progressively larger coherence lengths as the training time or dataset size grows. We discuss connections between the hierarchical clustering mechanism we introduce here and the renormalization group in physics.
Lay Summary: How can AI models learn to be truly creative, composing entirely new combinations of what they’ve seen before? Humans do this all the time in language and art, but for AI models, it's not well understood how this ability develops. In our research, we investigated how powerful generative AI models, known as diffusion models, learn to combine simple building blocks into more complex patterns. We used a simplified model of grammar rules to explore this, and discovered that these AI models learn local patterns first (like the ability to produce short phrases), and only gradually build up the ability to create globally coherent outputs (like full stories or complex images) as they see more data. We confirmed these findings not only in synthetic data but also in real-world settings like image and text generation. Our work helps explain how AI creativity emerges.
Primary Area: Deep Learning
Keywords: Science of deep learning, compositionality, diffusion models, probabilistic graphical models, sample complexity, generalization
Submission Number: 16314
Loading