DMF-GAN: Deep Multimodal Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Bing Yang; Xueqin Xiang; Wanzeng Kong; Jianhai Zhang; Yong Peng

DMF-GAN: Deep Multimodal Fusion Generative Adversarial Networks for Text-to-Image Synthesis

Bing Yang, Xueqin Xiang, Wanzeng Kong, Jianhai Zhang, Yong Peng

Published: 01 Jan 2024, Last Modified: 13 Nov 2024IEEE Trans. Multim. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Text-to-image synthesis aims to generate high-quality realistic images conditioned on text description. The great challenge of this task depends on deeply and seamlessly integrating image and text information. Thus, in this paper, we propose a deep multimodal fusion generative adversarial networks (DMF-GAN) that allows effective semantic interactions for fine-grained text-to-image generation. Specifically, through a novel recurrent semantic fusion network, DMF-GAN could consistently manipulate global assignment of text information among isolated fusion blocks. With the assistance of a multi-head attention module, DMF-GAN could model word information from different perspectives and further improve the semantic consistency. In addition, a word-level discriminator is proposed to provide the generator with fine-grained feedback related to each word. Compared with current state-of-the-art methods, our proposed DMF-GAN could efficiently synthesize realistic and text-alignment images and achieve better performance on challenging benchmarks.

Loading