Cross-Modal Joint Embedding with Diverse Semantics

Zhongwei Xie, Ling Liu, Yanzhao Wu, Lin Li, Luo Zhong

2020 (modified: 17 Nov 2022)CogMI 2020Readers: Everyone

Abstract: Textual-visual cross-modal retrieval has been an active research area in both computer vision and natural language processing communities. Most existing works learn a joint embedding model that maps raw text-image pairs onto a joint latent representation space in which the similarity between textual embeddings and visual embeddings can be computed and compared, without leveraging diverse semantics. This paper presents a general framework to study and evaluate the impact of diverse semantics extracted from the multi-modal input data on the quality and performance of joint embedding learning. We identify different ways that conventional textual features, such as TFIDF term frequency semantics and image category semantics, can be combined with neural features to further boost the efficiency of joint embedding learning. Experiments on the benchmark dataset Recipe1M demonstrates that existing representative cross-modal joint embedding approaches enhanced with diverse semantics in both raw inputs and joint embedding loss optimization can effectively boost their cross-modal retrieval performance.

0 Replies