Multimodal Fusion Generative Adversarial Network for Image Synthesis

Published: 01 Jan 2024, Last Modified: 08 Apr 2025IEEE Signal Process. Lett. 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Text-to-image synthesis has advanced significantly; however, a crucial limitation persists: textual descriptions often neglect essential background details, leading to blurred backgrounds and diminished image quality. To address this, we propose a multimodal fusion framework that integrates information from both text and image modalities. Our approach introduces a background mask to compensate for missing textual descriptions of background elements. Additionally, we employ an adaptive channel attention mechanism to effectively exploit fused features, dynamically accentuating informative feature maps. Furthermore, we introduce a novel fusion conditional loss, ensuring that generated images not only align with textual descriptions but also exhibit realistic backgrounds. Experimental evaluations on the Caltech-UCSD Birds 200 and COCO datasets demonstrate the efficacy of our approach, with our Frechet Inception Distance (FID) achieving a commendable score of 15.38 on the CUB dataset, surpassing several state-of-the-art approaches.
Loading