Universal Vision-Language Dense Retrieval: Learning A Unified Representation Space for Multi-Modal RetrievalDownload PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023ICLR 2023 posterReaders: Everyone
Keywords: Multi-Modal Retrieval, Dense Retrieval, Universal Embedding Space, Modality-Balanced Hard Negative Training, Image Verbalization
Abstract: This paper presents Universal Vision-Language Dense Retrieval (UniVL-DR), which builds a unified model for multi-modal retrieval. UniVL-DR encodes queries and multi-modality resources in an embedding space for searching candidates from different modalities. To learn a unified embedding space for multi-modal retrieval, UniVL-DR proposes two techniques: 1) Universal embedding optimization strategy, which contrastively optimizes the embedding space using the modality-balanced hard negatives; 2) Image verbalization method, which bridges the modality gap between images and texts in the raw data space. UniVL-DR achieves the state-of-the-art on the multi-modal open-domain question answering benchmark, WebQA, and outperforms all retrieval models on the two subtasks, text-text retrieval and text-image retrieval. It demonstrates that universal multi-modal search is feasible to replace the divide-and-conquer pipeline with a united model and also benefits single/cross modality tasks. All source codes of this work are available at https://github.com/OpenMatch/UniVL-DR.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Applications (eg, speech processing, computer vision, NLP)
TL;DR: This paper presents Vision-Language Universal Search (VL-UnivSearch), which builds a unified model for multi-modal retrieval, leans universal representations for images and texts, and achieves the state-of-the-art.
Supplementary Material: zip
8 Replies

Loading