Do Diffusion Models Learn Semantically Meaningful and Efficient Representations?

ICLR 2024 Workshop ME-FoMo Submission55 Authors

Published: 04 Mar 2024, Last Modified: 30 Apr 2024ME-FoMo 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: diffusion models, compositional generalization, training dynamics, representation learning, phase transitions
TL;DR: We conduct a toy model study to demonstrate that simple vanilla-flavored conditional diffusion models learn semantically meaningful but inefficient representations.
Abstract: Diffusion models have shown the ability to generate images with unconventional compositions, such as astronauts riding horses on the moon, indicating compositional generalization. However, the vast size and the complex nature of realistic datasets make it challenging to quantitatively probe diffusion model's abilility to compositionally generalize. Here, we consider a highly reduced setting to examine whether diffusion models learn semantically meaningful and fully factorized representations of composable features. We perform controlled experiments on conditional DDPMs learning to generate 2D spherical Gaussian bumps centered at specified $x$- and $y$-positions. En route to successful learning semantically meaningful representations, we observe three distinct learning phases: (phase A) no latent structure, (phase B) a 2D manifold of disordered states, and (phase C) a 2D ordered manifold, each with distinct generation behavior/failure modes. Furthermore, we show that even under imbalanced datasets where features ($x$- versus $y$-positions) are represented with skewed frequencies, the learning process for $x$ and $y$ is coupled rather than factorized, potentially indicating that the model does not learn a fully factorized hence efficient representation.
Submission Number: 55
Loading