Improving Diversity of Image Captioning Through Variational Autoencoders and Adversarial Learning

Li Ren, Guo-Jun Qi, Kien A. Hua

2019 (modified: 06 Nov 2022)WACV 2019Readers: Everyone

Abstract: Learning translation from images to human-readable natural language has become a great challenge in computer vision research in recent years. Existing works explore the semantic correlation between the visual and language domains via encoder-to-decoder learning frameworks based on classifying visual features in the language domain. This approach, however, is criticized for its lacking of naturalness and diversity. In this paper, we demonstrate a novel way to learn a semantic connection between visual information and natural language directly based on a Variational Autoencoder (VAE) that is trained in an adversarial routine. Instead of using the classification based discriminator, our method directly learns to estimate the diversity between a hidden vector embedded from a text encoder and an informative feature that is sampled from a learned distribution of the autoencoders. We show that the sentences learned from this matching contains accurate semantic meaning with high diversity in the image captioning task. Our experiments on the popular MSCOCO dataset indicates that our method learns to generate high-quality natural language with competitive scores on both correctness and diversity.

0 Replies