Part-Aware Prompt Tuning for Weakly Supervised Referring Expression Grounding

Chenlin Zhao, Jiabo Ye, Yaguang Song, Ming Yan, Xiaoshan Yang, Changsheng Xu

Published: 2024, Last Modified: 29 Feb 2024MMM (3) 2024Readers: Everyone

Abstract: Referring expression grounding represents a complex multimodal task that merits meticulous investigation. To alleviate the conventional methods’ reliance on fine-grained supervised data, there is a pressing need to explore visual grounding techniques under the weakly-supervised setting, encompassing only image-text pairs. Weakly supervised method with pretrained multimodal model has achieved impressive results; however, during the inference phase, it fails to generate a comprehensive attention map for entities, consequently leading to a reduction in inference accuracy. In this study, we introduce Part-aware Prompt Tuning (PPT), an innovative weakly supervised method. By dividing the entities extracted by the detector into different parts to optimize the part-aware prompt during the training phase, these prompt can guide the attention of pretrained multimodal model during the inference phase to obtain a more comprehensive focus on the whole entity, thereby enhancing inference accuracy. Empirical validation on two benchmark datasets, RefCOCO and RefCOCO+, underscores the remarkable superiority of our proposed method over prior referring expression grounding methods.

0 Replies