everyone
since 04 Oct 2024">EveryoneRevisionsBibTeXCC BY 4.0
Weakly supervised phrase grounding (WSPG) aims to localize objects referred by phrases without region-level annotations. The state-of-the-art methods use vision-language pre-trained (VLP) models to build pseudo labels. However, their low quality could result in the ineffectiveness of the subsequent learning. In this paper, we propose a novel WSPG framework, Dual-cycle Consistency Learning (DCL). Firstly, we propose a vision-modal cycle consistency to localize the referred objects and reconstruct the pseudo labels. To provide a conditional guidance, we propose a visual prompt engineering to generate marks for input images. To further avoid localizing randomly, we design a confidence-based regularization to filter out redundant information in image and pixel levels. Secondly, we propose a language-modal cycle consistency to correctly recognize the referred objects. To correct their positions, we provide phrase-related boxes as supervision for further learning. Extensive experiments on benchmark datasets show the effectiveness of DCL, as well as its excellent compatibility with various VLP models. The source code will be available at GitHub after double-blind phase.