Disentangling Inter- and Intra-Video Relations for Multi-event Video-Text Retrieval and Grounding

Mengzhao Wang; Yafei Zhang; Huafeng Li; Jinxing Li

Disentangling Inter- and Intra-Video Relations for Multi-event Video-Text Retrieval and Grounding

Mengzhao Wang, Yafei Zhang, Huafeng Li, Jinxing Li

27 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video-Text Retrieval; Grounding; Multi-event Queries

Abstract: Video-text retrieval aims to precisely search for videos most relevant to a text query within a video corpus. However, existing methods are largely limited to single-text (single-event) queries, which are not effective at handling multi-text (multi-event) queries. Furthermore, these methods typically focus solely on retrieval and do not attempt to locate multiple events within the retrieved videos. To address these limitations, our paper proposes a novel method named Disentangling Inter- and Intra-Video relations, which jointly considers multi-event video-text retrieval and grounding. This method leverages both inter-video and intra-video event relationships to enhance the performance of retrieval and grounding. At the retrieval level, we devise a Relational Event-Centric Video-Text Retrieval module, based on the principle that more comprehensive textual information leads to a more precise correspondence between text and video. It incorporates event relationship features at different hierarchical levels and exploits the hierarchical structure of corresponding video relationships to achieve multi-level contrastive learning between events and videos. This approach enhances the richness, accuracy, and comprehensiveness of event descriptions, improving alignment precision between text and video and enabling effective differentiation among videos. For event localization, we propose Event Contrast-Driven Video Grounding, which accounts for positional differences between different events and achieves precise grounding of multiple events through divergence learning of event locations. Our solution not only provides efficient text-to-video retrieval capabilities but also accurately locates events within the retrieved videos, addressing the shortcomings of existing methods. Extensive experimental results on the ActivityNet-Captions and Charades-STA benchmark datasets demonstrate the superior performance of our method, clearly validating its effectiveness. The innovation of this research lies in introducing a new joint framework for video-text retrieval and multi-event localization, while offering new ideas for further research and applications in related fields.

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 8559

Loading