WGREC: Weakly Supervised Generalized Referring Expression Comprehension Empowered by Large Language Model
Keywords: Weakly Supervised Referring Expression Comprehension, Weakly Supervised Generalized Referring Expression Comprehension, graph-based knowledge distillation network, large language model
Abstract: Weakly Supervised Referring Expression Comprehension (WREC) aims to locate the target object described by a given expression using weak supervision signals, such as image-text pairs.
%
Existing WREC methods typically assume that for every expression, there is always a corresponding object in the image or each frame of a video, ignoring scenarios where multiple objects or no objects match the expression.
%
Additionally, current WREC methods primarily rely on contrastive learning, using numerous positive and negative pairs to construct the loss. This approach has drawbacks: it incurs high computational and memory costs, reduces training efficiency, and is highly sensitive to pair selection, which can lead to unstable convergence or overfitting to specific pairs.
%
In this paper, we introduce a new task, Weakly Supervised Generalized Referring Expression Comprehension (WGREC), which extends traditional WREC to handle more realistic and complex scenarios.
%
To address this task, we design a novel graph-based knowledge distillation network (GKDN) guided by a large language model (LLM).
By using the LLM, we obtain two types of information: (1) descriptions of object candidates and their relationships, and (2) pseudo-target positions for single or multiple objects mentioned in the expression. This information helps our network build attention graphs that model the link between objects and the expression while filtering out irrelevant candidates.
Finally, a concise objective function is designed, leveraging predictions, expressions, and pseudo target positions, to distill the capabilities of the LLM into our network. Extensive experiments on gRefCOCO, RefCOCO, RefCOCO+, and RefCOCOg datasets demonstrate that our method achieves state-of-the-art (SoTA) performance, highlighting the effectiveness of our approach and its potential to advance the field of WGREC.
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 12618
Loading