Multi-Sentence Complementarily Generation for Text-to-Image Synthesis

Published: 01 Jan 2024, Last Modified: 08 Apr 2025IEEE Trans. Multim. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Generating realistic images based on text descriptions remains challenging in computer vision. Existing multi-stage generation methods are sufficient to generate high-resolution images. However, these methods mainly use one sentence to synthesize images, which are difficult to extract adequate semantic features, resulting in the generated images being far apart from ground-truth images. In this article, we propose a Multi-Sentence Complementary Generative Adversarial Network (MSCGAN), which assists in generating accurate images by fusing the same semantics from different sentences and preserving their unique semantics. More specifically, the latest BERT model is employed to identify semantic features and a multi-semantic fusion module (MSFM) is designed to fuse the semantic features of different sentences. Besides, a pre-trained cross-modal contrast similarity model (CCSM) is developed to explore fine-grained loss on generated images. Moreover, a multi-sentence joint discriminator is designed to ensure that the generated images match all sentences. Experiments and ablation studies on CUB and MS-COCO datasets demonstrate the significant superiority of the proposed method compared to state-of-the-art methods.
Loading