Weakly Supervised Referring Expression Grounding via Target-Guided Knowledge Distillation

Jinpeng Mi, Song Tang, Zhiyuan Ma, Dan Liu, Qingdu Li, Jianwei Zhang

Published: 01 Jan 2023, Last Modified: 30 Sept 2024ICRA 2023EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Weakly supervised referring expression grounding aims to train a model without the manual labels between image regions and referring expressions during the training phase. Current predominant models often adopt deep structures to reconstruct the region-expression correspondence. A crucial deficiency of the existing approaches lies in that these models neglect to exploit potential valuable information to further improve their grounding performance. To address this issue, we leverage knowledge distillation as a unique scheme to excavate and transfer helpful information for acquiring a better model. Specifically, we propose a target-guided knowledge distillation framework that accounts for region-expression pairs reconstruction and matching. We reactivate the target-related prediction information learned by a pre-trained teacher model and transfer the target-related prediction knowledge from the teacher to guide the training process and boost the performance of the student model. We conduct extensive experiments on three benchmark datasets, i.e., RefCOCO, RefCOCO+, and RefCOCOg. Without bells and whistles, our approach achieves state-of-the-art results on several splits of benchmark datasets. The implementation codes and trained models are available at: https://github.com/dami23/WREG_KD.