Keywords: diffusion models, spatial-awareness, compositionality
TL;DR: compositional awareness in diffusion models from training on a novel text-image dataset
Abstract: Diffusion models have demonstrated powerful generation capabilities over recent years, achieving impressive performance on visual tasks such as text-guided image generation, inpainting, denoising and photorealistic image synthesis at high resolutions. However, even state-of-the-art diffusion models like DALL-E 2  still struggle with basic visual reasoning: for instance, they often incorrectly represent compositional relationships, object counts, and negations in their text-guided generations . In this paper, we compare GLIDE vs Stable Diffusion with the following contributions: 1. probe visual reasoning failure modes during diffusion generation, 2. create a text-image dataset (GQA-Captions) from scene graphs for the purpose of improving text-to-image generation compositionality and 3. assess whether finetuning on spatially focused datasets can improve the compositional correctness of diffusion model generations both quantitatively and qualitatively (human evaluation). We also discuss limitations in existing quantitative metrics for assessing spatial reasoning in diffusion model generations. Our evaluations suggest that finetuning on spatially robust text-image data positively correlates with compositional correctness in diffusion generations.
Submission Type: non-archival
Presentation Type: online
Presenter: Jason Lin, Maya Srikanth