Abstract: Phrase Grounding, i.e., PG aims to locate objects referred by noun phrases. Recently, PG under weak supervision (i.e., grounding without region-level annotations) and zero-shot PG (i.e., grounding from seen categories to unseen ones) are proposed, respectively. However, for real-world applications these two approaches are limited due to slight annotations and numerable categories during training. In this paper, we propose a framework of zero-shot PG under weak supervision. Specifically, our PG framework is built on triple alignment strategies. Firstly, we propose a region-text alignment (RTA) strategy to build region-level attribute associations via CLIP. Secondly, we propose a domain alignment (DomA) strategy by minimizing the difference between distributions of seen classes in the training and those of the pre-training. Thirdly, we propose a category alignment (CatA) strategy by considering both category semantics and region-category relations. Extensive experiment results show that our proposed PG framework outperforms previous zero-shot methods and achieves competitive performance compared with existing weakly-supervised methods. The code and data will be publicly available at GitHub after double-blind phase.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: This work focuses on the task of phrase grounding, which fits for vision and language area in multimodal information prcoessing.
Supplementary Material: zip
Submission Number: 825
Loading