UNITER: Learning UNiversal Image-TExt Representations

Yen-Chun Chen; Linjie Li; Licheng Yu; Ahmed El Kholy; Faisal Ahmed; Zhe Gan; Yu Cheng; Jingjing Liu

UNITER: Learning UNiversal Image-TExt Representations

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, Jingjing Liu

25 Sept 2019 (modified: 22 Jun 2025)ICLR 2020 Conference Blind SubmissionReaders: Everyone

Keywords: Self-supervised Representation Learning, Large-scale Pre-training, Vision and Language

TL;DR: We introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over image-text datasets, achieves state-of-the-art results across six Vision-and-Language tasks over nine datasets.

Abstract: Joint image-text embedding is the bedrock for most Vision-and-Language (V+L) tasks, where multimodality inputs are jointly processed for visual and textual understanding. In this paper, we introduce UNITER, a UNiversal Image-TExt Representation, learned through large-scale pre-training over four image-text datasets (COCO, Visual Genome, Conceptual Captions, and SBU Captions), which can power heterogeneous downstream V+L tasks with joint multimodal embeddings. We design three pre-training tasks: Masked Language Modeling (MLM), Image-Text Matching (ITM), and Masked Region Modeling (MRM, with three variants). Different from concurrent work on multimodal pre-training that apply joint random masking to both modalities, we use Conditioned Masking on pre-training tasks (i.e., masked language/region modeling is conditioned on full observation of image/text). Comprehensive analysis shows that conditioned masking yields better performance than unconditioned masking. We also conduct a thorough ablation study to find an optimal combination of pre-training tasks for UNITER. Extensive experiments show that UNITER achieves new state of the art across six V+L tasks over nine datasets, including Visual Question Answering, Image-Text Retrieval, Referring Expression Comprehension, Visual Commonsense Reasoning, Visual Entailment, and NLVR2.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 7 code implementations](https://www.catalyzex.com/paper/uniter-learning-universal-image-text/code)

Original Pdf: pdf

16 Replies

Loading