What and Where: Semantic Grasping and Contextual Scanning for Moment Retrieval and Highlight Detection
Abstract: The current surge in video content highlights the tasks of moment retrieval (MR) and highlight detection (HD), which involve localizing video segments of events and predicting clip-wise saliency scores based on text queries. The recent methods, while effective, may overlook two aspects: 1) Multimodal features often show weak alignment from frozen encoders, hindering thorough semantic exploration of video clips through fine-grained cross-modal interaction. 2) Due to the absence of significant distinction between adjacent video clips, it is challenging for clip-level context modeling to accurately locate query-relevant content. To mitigate these gaps and inspired by the human routine in understanding visual events, we propose a progressive framework dubbed “what and where” to initially grasp the aligned semantics of each video clip, and then proceed to scan moment-level contextual features temporally to identify events matching the query. In the ‘what’ stage, to enable explicit alignment of modal features and achieve a thorough semantic understanding, we firstly devise the Initial Semantic Projection (ISP) loss to bring closer different modal features with similar semantics. Additionally, we develop a Clip Semantic Mining module to deeply mine the relevance of these identified semantics to the specific query (at both word- and sentence-level). In the ‘where’ stage, to enhance feature distinctiveness, we design a Multi-Context Perception module that models moment-level context. It includes an Event Context (EC) branch and a Chronological Context (CC) branch, focusing on possible query-relevant event moments and temporal moments of various lengths. Finally, extensive experiments validate the state-of-the-art performance of our W2W model on three benchmark datasets without additional pre-training. Codes are available at https://github.com/TJUMMG/W2W.
External IDs:dblp:journals/tcsv/LiuHNZS25
Loading