Triple Alignment Strategies for Zero-shot Phrase Grounding under Weak Supervision

Pengyue Lin; Ruifan Li; Yuzhe Ji; Zhihan Yu; Fangxiang Feng; Zhanyu Ma; Xiaojie Wang

Triple Alignment Strategies for Zero-shot Phrase Grounding under Weak Supervision

Pengyue Lin, Ruifan Li, Yuzhe Ji, Zhihan Yu, Fangxiang Feng, Zhanyu Ma, Xiaojie Wang

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Phrase Grounding, i.e., PG aims to locate objects referred by noun phrases. Recently, PG under weak supervision (i.e., grounding without region-level annotations) and zero-shot PG (i.e., grounding from seen categories to unseen ones) are proposed, respectively. However, for real-world applications these two approaches are limited due to slight annotations and numerable categories during training. In this paper, we propose a framework of zero-shot PG under weak supervision. Specifically, our PG framework is built on triple alignment strategies. Firstly, we propose a region-text alignment (RTA) strategy to build region-level attribute associations via CLIP. Secondly, we propose a domain alignment (DomA) strategy by minimizing the difference between distributions of seen classes in the training and those of the pre-training. Thirdly, we propose a category alignment (CatA) strategy by considering both category semantics and region-category relations. Extensive experiment results show that our proposed PG framework outperforms previous zero-shot methods and achieves competitive performance compared with existing weakly-supervised methods. The code and data will be publicly available at GitHub after double-blind phase.

Primary Subject Area: [Content] Vision and Language

Relevance To Conference: This work focuses on the task of phrase grounding, which fits for vision and language area in multimodal information prcoessing.

Supplementary Material: zip

Submission Number: 825

Loading