GLIPv2: Unifying Localization and Vision-Language Understanding

Haotian Zhang; Pengchuan Zhang; Xiaowei Hu; Yen-Chun Chen; Liunian Harold Li; Xiyang Dai; Lijuan Wang; Lu Yuan; Jenq-Neng Hwang; Jianfeng Gao

GLIPv2: Unifying Localization and Vision-Language Understanding

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao

Published: 31 Oct 2022, Last Modified: 04 Aug 2025NeurIPS 2022 AcceptReaders: Everyone

Keywords: region-aware, vision-language, open-vocabulary object detection and segmentation, phrase grounding, VQA, image captioning

TL;DR: We present a region-aware vision-language pre-trained model that serves both localization tasks (e.g., object detection, instance segmentation) and understanding (e.g., VQA, image captioning) tasks

Abstract: We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks.

Supplementary Material: pdf

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/glipv2-unifying-localization-and-vision/code)

13 Replies

Loading