Abstract: Text-to-image synthesis aims to generate high-quality realistic images conditioned on text description. The great challenge of this task depends on deeply and seamlessly integrating image and text information. Thus, in this paper, we propose a deep multimodal fusion generative adversarial networks (DMF-GAN) that allows effective semantic interactions for fine-grained text-to-image generation. Specifically, through a novel recurrent semantic fusion network, DMF-GAN could consistently manipulate global assignment of text information among isolated fusion blocks. With the assistance of a multi-head attention module, DMF-GAN could model word information from different perspectives and further improve the semantic consistency. In addition, a word-level discriminator is proposed to provide the generator with fine-grained feedback related to each word. Compared with current state-of-the-art methods, our proposed DMF-GAN could efficiently synthesize realistic and text-alignment images and achieve better performance on challenging benchmarks.
Loading