VG-Annotator: Vision-Language Models as Query Annotators for Unsupervised Visual Grounding

Jiabo Ye; Junfeng Tian; Xiaoshan Yang; Zhenru Zhang; Anwen Hu; Ming Yan; Ji Zhang; Liang He; Xin Lin

VG-Annotator: Vision-Language Models as Query Annotators for Unsupervised Visual Grounding

Jiabo Ye, Junfeng Tian, Xiaoshan Yang, Zhenru Zhang, Anwen Hu, Ming Yan, Ji Zhang, Liang He, Xin Lin

Published: 01 Jan 2024, Last Modified: 14 Nov 2024ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Visual grounding focuses on localizing objects referred to by natural language queries. Existing fully and weakly supervised methods rely on a mass of language queries for training. However, collecting natural language queries corresponding to specific objects by annotators is expensive. To reduce the reliance on human-written queries, we propose a novel unsupervised visual grounding framework named VG-Annotator. Different from the existing unsupervised methods that rely on manually designed rules to link objects and language queries. The key idea of VG-Annotator lies in that vision-language pre-trained (VLP) generation models can be language query annotators. Thanks to the powerful multi-modal understanding ability implicitly learned from large-scale pre-training, we consider stimulating models to explicitly generate appropriate descriptions for specific objects in natural language. To this end, we explore a series of multi-modal instructions to indicate which object should be described. We also introduce a supervised fine-tuning process to teach the vision-language models to follow the instructions. Extensive experiments show that the proposed method obtains high-quality language queries. The visual grounding model trained with the generated queries outperforms state-of-the-art unsupervised methods on five widely used datasets.

Loading