Abstract: Despite the remarkable success of diffusion models (DMs) in data generation, they exhibit specific failure cases with unsatisfactory outputs. We focus on one such limitation: the ability of DMs to learn hidden rules between image features. Specifically, for image data with dependent features ($\mathbf{x}$) and ($\mathbf{y}$) (e.g., the height of the sun ($\mathbf{x}$) and the length of the shadow ($\mathbf{y}$)), we investigate whether DMs can accurately capture the inter-feature rule ($p(\mathbf{y}|\mathbf{x})$). Empirical evaluations on mainstream DMs (e.g., Stable Diffusion 3.5) reveal consistent failures, such as inconsistent lighting-shadow relationships and mismatched object-mirror reflections. Inspired by these findings, we design four synthetic tasks with strongly correlated features to assess DMs' rule-learning abilities. Extensive experiments show that while DMs can identify coarse-grained rules, they struggle with fine-grained ones. Our theoretical analysis demonstrates that DMs trained via denoising score matching (DSM) exhibit constant errors in learning hidden rules, as the DSM objective is not compatible with rule conformity. To mitigate this, we introduce a common technique - incorporating additional classifier guidance during sampling, which achieves (limited) improvements. Our analysis reveals that the subtle signals of fine-grained rules are challenging for the classifier to capture, providing insights for future exploration.
Lay Summary: Our study reveals a new failure of diffusion models (DMs)—a previously underexplored and unsolved challenge!
In our work, we ask ***Can image generative models like diffusion models truly capture the underlying rules embedded in image data?*** For instance, can DMs accurately understand fine-grained rules in the image, such as how the height of the sun influences the length of a shadow? This question is crucial for using DMs to faithfully reconstruct the physical world.
By extensive experiments on both synthetic tasks and real-world datasets, our findings provide a clear answer: DMs can learn coarse rules (e.g., the sun and shadow should appear on opposite sides of an object), but they struggle to capture fine-grained rules (e.g., the precise geometric constrains between the sun’s height and the shadow’s length). And our theoretical analysis suggests that the root cause lies in a mismatch between the optimization objective of DMs and the underlying rules embedded in the data, which leads to persistent constant errors in rule learning. What’s worse, addressing this issue using conventional techniques—such as introducing guidance during sampling—has shown limited improvement. One of the key bottlenecks is that fine-grained rules typically manifest as weak signals within the data, making them difficult to capture and leverage for effective guidance.
Primary Area: Theory->Deep Learning
Keywords: Diffusion Model, Deep Generative Model
Submission Number: 15070
Loading