LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts

Binbin Yang, Yi Luo, Ziliang Chen, Guangrun Wang, Xiaodan Liang, Liang Lin

02 Nov 2023OpenReview Archive Direct UploadReaders: Everyone

Abstract: Thanks to the rapid development of diffusion models, un- precedented progress has been witnessed in image synthe- sis. Prior works mostly rely on pre-trained linguistic mod- els, but a text is often too abstract to properly specify all the spatial properties of an image, e.g., the layout config- uration of a scene, leading to the sub-optimal results of complex scene generation. In this paper, we achieve ac- curate complex scene generation by proposing a seman- tically controllable Layout-AWare diffusion model, termed LAW-Diffusion. Distinct from the previous Layout-to-Image generation (L2I) methods that primarily explore category- aware relationships, LAW-Diffusion introduces a spatial de- pendency parser to encode the location-aware semantic co- herence across objects as a layout embedding and produces a scene with perceptually harmonious object styles and con- textual relations. To be specific, we delicately instantiate each object’s regional semantics as an object region map and leverage a location-aware cross-object attention mod- ule to capture the spatial dependencies among those dis- entangled representations. We further propose an adap- tive guidance schedule for our layout guidance to mitigate the trade-off between the regional semantic alignment and the texture fidelity of generated objects. Moreover, LAW- Diffusion allows for instance reconfiguration while main- taining the other regions in a synthesized image by introduc- ing a layout-aware latent grafting mechanism to recompose its local regional semantics. To better verify the plausibil- ity of generated scenes, we propose a new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to mea- sure how the images preserve the rational and harmonious relations among contextual objects. Comprehensive ex- periments on COCO-Stuff and Visual-Genome demonstrate that our LAW-Diffusion yields the state-of-the-art genera- tive performance, especially with coherent object relations.

0 Replies