generative adversarial network with hierarchical semantic prompt constrainting clip for high-quality text-to-image synthesis

Shuheng Ge; Xiangqian Wu

generative adversarial network with hierarchical semantic prompt constrainting clip for high-quality text-to-image synthesis

Shuheng Ge, Xiangqian Wu

18 Sept 2023 (modified: 31 Jul 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX

Primary Area: generative models

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: Text-To-Image，Hierarchical Semantic Guide ，CLIP，Prompt Constrain，GAN，Hard Mining

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: How to synthesize efficient, controllable, and semantically relevant high-quality images based on text is currently a very challenging task. Combining generative adversarial networks with CLIP models to improve the quality of synthesized images has revitalized GAN in the field of generation. Compared with diffusion models, GAN has faster generation speed, fewer training resources and parameters, and more controllable generation results. However, the current methods for combining CLIP and GAN are relatively rough, mostly used as text encoders and feature bridges, without fully utilizing the semantic alignment ability of CLIP networks, ignoring the structural and hierarchical nature of semantic features, and resulting in poor semantic consistency in synthesized images. In response to these problems, we propose HSPC-GAN, which is a method of constructing structural semantic prompts and using them to hierarchically guide CLIP to adjust visual features for generation of high-quality images with controllable semantic consistency. HSPC-GAN extracts semantic concepts through part of speech analysis, constructs a prompt generator and a prompt adaptor to generate learnable hierarchical semantic prompts, and using these prompts to selectively guide CLIP adapters to adjust image features to improve semantic consistency between synthesized images and conditional texts. At the same time, we introduced the mining of hard negative samples into the construction of the discriminator loss function for the first time, improving the discriminator’s ability to distinguish mismatched samples and reducing the impact of the generated model’s requirements for batch size and epoch on training results. A large number of experimental results have proven the effectiveness of our method, which can quickly synthesize high-quality images with consistent semantics, and achieve state of the art on public datasets.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1407

Loading