QueryMatch: A Query-based Contrastive Learning Framework for Weakly Supervised Visual Grounding

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Visual grounding is a task of locating the object referred by a natural language description. To reduce annotation costs, recent researchers are devoted into one-stage weakly supervised methods for visual grounding, which typically adopt the anchor-text matching paradigm. Despite the efficiency, we identify that anchor representations are often noisy and insufficient to describe object information, which inevitably hinders the vision-language alignments. In this paper, we propose a novel query-based one-stage framework for weakly supervised visual grounding, namely QueryMatch. Different from previous work, QueryMatch represents candidate objects with a set of query features, which inherently establish accurate one-to-one associations with visual objects. In this case, QueryMatch re-formulates weakly supervised visual grounding as a query-text matching problem, which can be optimized via the query-based contrastive learning. Based on QueryMatch, we further propose an innovative strategy for effective weakly supervised learning, namely Negative Sample Quality Estimation (NSQE). In particular, NSQE aims to augment negative training samples by actively selecting high-quality query features. Though this strategy, NSQE can greatly benefit the weakly supervised learning of QueryMatch. To validate our approach, we conduct extensive experiments on three benchmark datasets of two grounding tasks, i.e., referring expression comprehension (REC) and segmentation (RES). Experimental results not only show the state-of-art performance of QueryMatch in two tasks, e.g., over +5\% IoU@0.5 on RefCOCO in REC and over +20\% mIOU on RefCOCO in RES, but also confirm the effectiveness of NSQE in weakly supervised learning. Source codes are available at~\url{https://anonymous.4open.science/r/QueryMatch-A82C}.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: We identify the shortcoming of object representations in existing one-stage weakly supervised visual grounding framework and propose an innovative strategy for effective weakly supervised learning, namely Negative Sample Quality Estimation (NSQE), which can greatly augment negative samples by selecting high-quality query features.
Submission Number: 2350
Loading