Multi-scale dual-modal generative adversarial networks for text-to-image synthesisDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 11 May 2023Multim. Tools Appl. 2023Readers: Everyone
Abstract: Generating images from text descriptions is a challenging task due to the natural gap between the textual and visual modalities. Despite the promising results of existing methods, they suffer from two limitations: (1) focus more on the image semantic information while fails to fully explore the texture information; (2) only consider to model the correlation between words and image with a fixed scale, thus decreases the diversity and discriminability of the network representations. To address above issues, we propose a Multi-scale Dual-modal Generative Networks (MD-GAN). The core components of MD-GAN are the dual-modal modulation attention (DMA) and the multi-scale consistency discriminator (MCD). The DMA includes two blocks: the textual guiding module that captures the correlation between images and text descriptions to rectify the image semantic content, and the channel sampling module that adjusts image texture by selectively aggregating the channel-wise information on spatial space. In addition, the MCD constructs the correlation between text and image region of various sizes, enhancing the semantic consistency between text and images. Extensive experiments on CUB and MS-COCO datasets show the superiority of MD-GAN over state-of-the-art methods.
0 Replies

Loading