Explicit Granularity and Implicit Scale Correspondence Learning for Point-Supervised Video Moment Localization

Published: 01 Jan 2024, Last Modified: 01 Aug 2025ACM Multimedia 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Video moment localization (VML) aims to identify the temporal boundary semantically matching the given query. Point-supervised VML balances localization accuracy and annotation cost but is still immature due to granularity alignment and scale perception issues. To this end, we propose a Semantic Granularity and Scale Correspondence Integration (SG-SCI) framework aimed at leveraging limited single-frame annotation for correspondence learning. It explicitly models semantic relations of different feature granularities and adaptively mines the implicit semantic scale, thereby enhancing feature representations of varying granularities and scales. SG-SCI uses granularity correspondence alignment to align semantics via latent prior knowledge and a scale correspondence learning to identify and address semantic scale differences. Extensive experiments on benchmark datasets have demonstrated the promising performance of our model over several state-of-the-art competitors.
Loading