Abstract: Video Moment Retrieval (VMR) aims to temporally localize a query-specified moment in untrimmed videos. Surveillance videos are crucial and indispensable for public security, and an automatic understanding of surveillance video content is crucial to enhance the existing investigative measures. Although existing VMR methods work reasonably well for conventional activity videos, they face particular challenges for videos in security domains, which are often of low quality and involve specific event-driven human behaviors. Specifically, existing methods are unable to capture precise foreground relevant information in noisy frames, and the over-reliance on word-level textual query features fails to model hierarchical semantics, hindering alignment with complex multi-event videos. To address these challenges, we propose an Event-driven Localization with Foreground-enhanced Representation (ELFR) framework, which consists of two key components: 1) in order to suppress background noise and enhance the saliency of foreground elements, the Foreground-Enhanced Representation (FER) module refines cross-modal alignment using spatio-temporal cross-modal attention; 2) the Event-Driven Localization (EDL) module extracts event-level semantic units and integrates these event features with enhanced visual representations to generate boundary predictions. Experiments conducted on surveillance-focused UCA datasets demonstrate that our proposed method has achieved state-of-the-art performances.
External IDs:dblp:conf/iconip/HeKMT25
Loading