Abstract: Interactive image retrieval is an emerging research topic with
the objective of integrating inputs from multiple modalities as query for
retrieval, e.g., textual feedback from users to guide, modify or refine im-
age retrieval. In this work, we study the problem of composing images
and textual modifications for language-guided retrieval in the context of
fashion applications. We propose a unified Joint Visual Semantic Match-
ing (JVSM) model that learns image-text compositional embeddings by
jointly associating visual and textual modalities in a shared discrimina-
tive embedding space via compositional losses. JVSM has been designed
with versatility and flexibility in mind, being able to perform multiple
image and text tasks in a single model, such as text-image matching and
language-guided retrieval. We show the effectiveness of our approach in
the fashion domain, where it is difficult to express keyword-based queries
given the complex specificity of fashion terms. Our experiments on three
datasets (Fashion-200k, UT-Zap50k, and Fashion-iq) show that JVSM
achieves state-of-the-art results on language-guided retrieval and addi-
tionally we show its capabilities to perform image and text retrieva
0 Replies
Loading