Learning Joint Visual Semantic Matching Embeddings for Language-guided Retrieval

Yanbei Chen, Loris Bazzani

08 Nov 2022OpenReview Archive Direct UploadReaders: Everyone

Abstract: Interactive image retrieval is an emerging research topic with the objective of integrating inputs from multiple modalities as query for retrieval, e.g., textual feedback from users to guide, modify or refine im- age retrieval. In this work, we study the problem of composing images and textual modifications for language-guided retrieval in the context of fashion applications. We propose a unified Joint Visual Semantic Match- ing (JVSM) model that learns image-text compositional embeddings by jointly associating visual and textual modalities in a shared discrimina- tive embedding space via compositional losses. JVSM has been designed with versatility and flexibility in mind, being able to perform multiple image and text tasks in a single model, such as text-image matching and language-guided retrieval. We show the effectiveness of our approach in the fashion domain, where it is difficult to express keyword-based queries given the complex specificity of fashion terms. Our experiments on three datasets (Fashion-200k, UT-Zap50k, and Fashion-iq) show that JVSM achieves state-of-the-art results on language-guided retrieval and addi- tionally we show its capabilities to perform image and text retrieva

0 Replies