Learning Semantic-Enhanced Dual Temporal Adjacent Maps for Video Moment Retrieval

25 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Computer Vision, Muylti-modal Understanding, Video Grounding
TL;DR: a novel semantic-enhanced dual temporal adjacent maps (DTAM) for effective video moment retrieval, which models temporal dependencies between moments in an appearance-semantic decoupled fashion.
Abstract:

Retrieving a specific moment from an untrimmed video via a text description is a central problem in vision-language learning. It is a challenging task due to the sophisticated temporal dependency among moments. Existing methods fail to deal with this issue well since they establish temporal relations of moments in a way that visual content and semantics are coupled. This paper studies temporal dependence schemes that decouple content and semantic information, establishing semantic-enhanced Dual Temporal Adjacent Maps for video moment retrieval, conferred as DTAM. Specifically, DTAM designs two branches to encode visual appearance and semantic knowledge from video clips respectively, where knowledge from the appearance branch is distilled into the semantic branch to help DTAM distinguish features with the same visual content but different semantics with a well-designed semantic-aware contrastive loss. Besides, we also develop a moment-aware mechanism to assist temporal adjacent maps' learning for better video grounding. Finally, extensive experimental results and analysis demonstrate the superiority of the proposed DTAM over existing state-of-the-art approaches on three challenging video moment retrieval benchmarks, i.e., TACoS, Charades-STA, and ActivityNet Captions.

Primary Area: applications to computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 4047
Loading