Abstract: Video moment retrieval (VMR) aims to localize a video segment in an untrimmed video that is semantically relevant to a language query. The challenge of this task lies in effectively aligning the intricate and information-dense video modality with the succinctly summarized textual modality, and further localizing the starting and ending timestamps of the target moments. Previous works have attempted to achieve multi-granularity alignment of video and query in a coarse-to-fine manner, yet these efforts still fall short in addressing the inherent disparities in representation and information density between videos and queries, leading to modal misalignments. In this paper, we propose a progressive video moment retrieval framework, initially retrieving the most relevant and irrelevant video clips to the query as semantic guidance, thereby bridging the semantic gap between video modality and language modality. Futhermore, we introduce a pseudo clips guided aggregation module to aggregate densely relevant moment clips closer together and propose a discriminative boundary-enhanced decoder with the guidance of pseudo clips to push the semantically confusing proposals away. Extensive experiments on the Charades-STA, ActivityNet Captions and TACoS datasets demonstrate that our method outperforms existing methods.
External IDs:dblp:journals/tmm/LiuZSYMZ25
Loading