Keywords: Text-to-image generation, global-local composition, diffusion models
TL;DR: a novel framework to compose global contexts and local details using diffusion models
Abstract: Diffusion models have demonstrated their capability to synthesize high-quality and diverse images from textual prompts. However, simultaneous control over both global contexts (e.g., object layouts and interactions) and local details (e.g., colors and emotions) still remains a significant challenge. The models often fail to understand complex descriptions involving multiple objects and reflect specified visual attributes to wrong targets or forget to reflect them. This paper presents MultiLayerDiffusion, a novel framework which allows simultaneous control over the global contexts and the local details in text-to-image generation without requiring training or fine-tuning. It assigns multiple global and local prompts to corresponding layers and composes them to generate images using pre-trained diffusion models. Our framework enables complex global-local compositions, decomposing intricate prompts into manageable concepts and controlling object details while preserving global contexts. We demonstrate that MultiLayerDiffusion effectively generates complex images that adhere to both user-provided object interactions and object details. We also show its effectiveness not only in image generation but also in image editing.
Supplementary Material: zip
Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4039
Loading