Primary Area: generative models
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Keywords: Text-to-image Synthesis, Diffusion model
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Abstract: Recent advance in text-to-image synthesis greatly benefits from large-scale vision-language models such as CLIP. Despite the capability of producing high-quality and creative images, existing methods often struggle in capturing details of the text prompt, especially when the text is lengthy. We reveal that such an issue is partially caused by the imperfect text-image matching using CLIP, where fine-grained semantics may get obscured by the dominant ones. This work presents a new diffusion-based method that favors fine-grained synthesis with semantic refinement. Concretely, instead of getting a synthesis using the entire descriptive sentence as the prompt, users can emphasize some specific words of their own interests. For this purpose, we incorporate a semantic-induced gradient as a reference input in each denoising step to help the model understand the selected sub-concept. We find out that our framework supports the combination of multiple semantics by directly adding up their corresponding gradients. Extensive results on various datasets suggest that our approach outperforms existing text-to-image generation methods by synthesizing semantic details with finer granularity.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 3059
Loading