Abstract: Video-text retrieval aims to precisely search for videos most relevant to text queries within a video corpus. However, existing methods are largely limited to single-text (single-event) queries and are not effective at handling multi-text (multi-event) queries. Furthermore, these methods typically focus solely on retrieval and do not attempt to locate multiple events within the retrieved videos. To address these limitations, our paper proposes a novel method named Disentangling Inter- and Intra-Video Relations, which jointly addresses multi-event video-text retrieval and grounding. This method leverages both inter-video and intra-video event relationships to enhance retrieval and grounding performance. At the retrieval level, we devise a Relational Event-Centric Video-Text Retrieval module based on the principle that comprehensive textual information leads to precise correspondence between text and video. It incorporates event relationship features at different hierarchical levels and exploits the hierarchical structure of video relationships to achieve multi-level contrastive learning between events and videos. This approach enhances the richness, accuracy, and comprehensiveness of event descriptions, improving alignment precision between text and video and enabling effective differentiation among videos. For event grounding, we propose Event Contrast-Driven Video Grounding, which accounts for positional differences among events on the 2D temporal score map and achieves precise grounding of multiple events through divergence learning for their locations. Our solution not only provides efficient text-to-video retrieval but also accurately grounds events within the retrieved videos, addressing the shortcomings of existing methods. Extensive experimental results on the ActivityNet Captions and Charades-STA benchmark datasets demonstrate the superior performance of our method, validating its effectiveness. The innovation of this research lies in introducing a new joint framework for video-text retrieval and multi-event grounding while offering new ideas for further research and applications in related fields. The code is available at https://github.com/X7J92/MVT-RG
External IDs:dblp:journals/tip/WangLZLTY25
Loading