Variational Deep Representation Learning for Cross-Modal Retrieval

Chen Yang, Zongyong Deng, Tianyu Li, Hao Liu, Libo Liu

Published: 01 Jan 2021, Last Modified: 13 Nov 2023PRCV (2) 2021Readers: Everyone

Abstract: In this paper, we propose a variational deep representation learning (VDRL) approach for cross-modal retrieval. Numerous existing methods map the image and text to the point representations, which is challenging to model the semantic multiplicity of the sample. To address this issue, our VDRL aims to map the image and text to the semantic distributions and measure the similarity by comparing the difference between their distributions. Specifically, our VDRL network is trained under three constraints: 1) The Variational Autoencoder loss is minimized to learn the distributions of both the images in image semantic space and the texts in text semantic space. 2) The mutual information is introduced to ensure the VDRL learns the intact distribution for the sample. 3) The triplet hinge loss is incorporated to align the distributions of the images and texts at the semantic level. Consequently, the semantic multiplicity of each sample is modeled in our method. Experimental results demonstrate that our approach achieves compelling performance with state-of-the-art methods.

0 Replies