Triple Alignment Strategies for Zero-shot Phrase Grounding under Weak Supervision

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Phrase Grounding, i.e., PG aims to locate objects referred by noun phrases. Recently, PG under weak supervision (i.e., grounding without region-level annotations) and zero-shot PG (i.e., grounding from seen categories to unseen ones) are proposed, respectively. However, for real-world applications these two approaches are limited due to slight annotations and numerable categories during training. In this paper, we propose a framework of zero-shot PG under weak supervision. Specifically, our PG framework is built on triple alignment strategies. Firstly, we propose a region-text alignment (RTA) strategy to build region-level attribute associations via CLIP. Secondly, we propose a domain alignment (DomA) strategy by minimizing the difference between distributions of seen classes in the training and those of the pre-training. Thirdly, we propose a category alignment (CatA) strategy by considering both category semantics and region-category relations. Extensive experiment results show that our proposed PG framework outperforms previous zero-shot methods and achieves competitive performance compared with existing weakly-supervised methods. The code and data will be publicly available at GitHub after double-blind phase.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This work focuses on the task of phrase grounding, which fits for vision and language area in multimodal information prcoessing.
Supplementary Material: zip
Submission Number: 825
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview