Augmented Visual-Semantic Embeddings for Image and Sentence Matching

Zerui Chen, Yan Huang, Liang Wang

2019 (modified: 03 Nov 2022)ICIP 2019Readers: Everyone

Abstract: The task of image and sentence matching has witnessed significant progress recently, but it is still challenging arising from the tremendous semantic gap between a pixel-level image and its matched sentences. Due to limited training data, it is rather challenging to optimize the visual-semantic embeddings. In this work, we propose to augment visual-semantic embeddings via enlarging the training dataset. With more data, models can learn discriminative features with high-quality semantic concepts. More specifically, we augment data by generating sentences for given images. Our method consists of two steps. At first, to enlarge the training dataset, given an image, we perform image captioning. Instead of introducing redundancy to our augmented dataset, we hope that our generated sentences are in diverse style and maintain its fidelity at the same time. Therefore, we consult to generative adversarial networks (GANs) which can produce more flexible expressions compared to methods based on the maximum likelihood principle. Then, we augment visual-semantic embeddings with the augmented training dataset and obtain the model for the task of image and sentence matching. Experiments on the popular benchmark demonstrate the effectiveness of our method by achieving superior results compared to our baseline.

0 Replies