Abstract: Most works of image captioning are implemented under the full supervision of paired image–caption data. Limited to expensive cost of data collection, the task of unpaired image captioning has attracted researchers’ attention. In this article, we propose a novel memorial GAN (MemGAN) with the joint semantic optimization for unpaired image captioning. The core idea is to explore implicit semantic correlation between disjointed images and sentences through building a multimodal semantic-aware space (SAS). Concretely, each modality is mapped into a unified multimodal SAS, where SAS includes the semantic vectors of image <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">${I}$ </tex-math></inline-formula> , visual concepts <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">${O}$ </tex-math></inline-formula> , unpaired sentence <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">${S}$ </tex-math></inline-formula> , and the generated caption <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">${C}$ </tex-math></inline-formula> . We adopt the memory unit based on multihead attention and relational gate as a backbone to preserve and transit crucial multimodal semantics in the SAS for image caption generation and sentence reconstruction. Then, the memory unit is embedded into a GAN framework to exploit the semantic similarity and relevance in SAS, that is, imposing a joint semantic-aware optimization on SAS without supervision clues. To summarize, the proposed MemGAN learns the latent semantic relevance of SAS’s multimodalities in an adversarial manner. Extensive experiments and qualitative results demonstrate the effectiveness of MemGAN, achieving improvements over state of the arts on unpaired image captioning benchmarks.
0 Replies
Loading