Spade : Training-Free Improvement of Spatial Fidelity in Text-to-Image Generation

18 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: T2I, Image Synthesis, Synthetic Dataset
TL;DR: Improving Spatial Understanding of Text to Image Models using Spatially Accurate Reference Images in a Training-free manner
Abstract: Text-to-Image (T2I) generation models have seen progressive improvements in their abilities to generate photo-realistic images. However, it has been demonstrated that they struggle to follow reasoning-intensive textual instructions, particularly when it comes to generating accurate spatial relationships between objects. In this work, we present an approach to improve upon the above shortcomings of these models by leveraging spatially accurate images (LSAI) as grounding reference to guide diffusion-based T2I models. Given an input prompt containing a spatial phrase, our method involves symbolically creating a corresponding synthetic image, which accurately represents the spatial relationship articulated in the prompt. Next, we use the created image alongside the text prompt, in a training-free manner to condition image synthesis models in generating spatially coherent images. To facilitate our LSAI method, we create SPADE, a large database of 190k text-image pairs, where each image is deterministically generated through open-source 3D rendering tools encompassing a diverse set of 80 MS-COCO objects. Variation of the images in SPADE is introduced through object and background manipulation as well as GPT-4 guided layout arrangement. We evaluate our method of utilizing SPADE as T2I guidance on Stable Diffusion and ControlNet, and find our LSAI method substantially improves upon existing methods on the VISOR benchmark. Through extensive ablations and analysis, we analyze LSAI with respect to multiple facets of SPADE and also perform human studies to demonstrate the effectiveness of our method on prompts which contain multiple relationships and out-of-distribution objects. Finally, we present our SPADE Generator as an extendable framework to the research community, emphasizing its potential for expansion.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 1091
Loading