Cross-Modal Retrieval Augmentation for Multi-Modal Classification

Shir Gur; Natalia Neverova; Chris Stauffer; Ser-Nam Lim; Douwe Kiela; Austin Reiter

Cross-Modal Retrieval Augmentation for Multi-Modal Classification

Shir Gur, Natalia Neverova, Chris Stauffer, Ser-Nam Lim, Douwe Kiela, Austin Reiter

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: Multi-Modal, VQA, Retrieval

Abstract: Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves state-of-the-art image-caption retrieval performance w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model significantly improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=P0CmuT0-l7

8 Replies

Loading