Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment Localization
Abstract: Video moment localization (VML) aims to identify the temporal boundary of the target moment semantically matching the given query. Existing approaches fall into three paradigms: fully-supervised, weakly-supervised, and point-supervised. Compared to other two paradigms, point-supervised VML strikes a balance between localization accuracy and annotation cost. However, it is still in its infancy due to the following two challenges: explicit granularity alignment and implicit scale perception, especially when facing complex cross-modal correspondences. To this end, we propose a Semantic Granularity and Scale Correspondence Integration (SG-SCI) framework aimed at modeling the semantic alignment between video and text, leveraging limited single-frame annotation information for correspondence learning. It explicitly models semantic relations of different feature granularities and adaptively mines the implicit semantic scale, thereby enhancing and utilizing modal feature representations of varying granularities and scales. SG-SCI employs a granularity correspondence alignment module to align semantic information by leveraging latent prior knowledge. Then we develop a scale correspondence learning strategy to identify and address semantic scale differences. Extensive comparison experiments, ablation studies, and necessary hyperparameter analyses on benchmark datasets have demonstrated the promising performance of our model over several state-of-the-art competitors.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: Cross-modal video moment localization is a process in multimedia research that involves finding specific moments within a video that correspond to a given textual description. This process typically requires integrating and analyzing text and video, highlighting the deep relationship between cross-modal technologies and multimedia.
Submission Number: 1859
Loading