Abstract: Highlights•A recurrent semantic fusion network to ensure a coherent fusion of text–visual cues.•A contrastive loss to strength the underlying semantics of text and the image.•A dynamic convolution to enable the generator to dynamically produce an image.•A word-level discriminator to capture relationship between word and image subregion.•Experimental results show the efficacy of SF-GAN on the CUB and COCO datasets.
External IDs:doi:10.1016/j.eswa.2024.125583
Loading