STDNet: Spatio-Temporal Decomposed Network for Video GroundingDownload PDFOpen Website

2022 (modified: 16 Nov 2022)ICME 2022Readers: Everyone
Abstract: Previous methods for video grounding treated either the query or the video as a whole, while neglecting their respective semantics in the orthogonal space and time dimensions. Since spatial semantics appears frequently in a video, temporal semantics is more discriminative and deserves more attention. Based on such considerations, we propose a novel Spatio-Temporal Decomposed Network (STDNet) which decomposes the query and the video into their spatial and temporal semantics, respectively. Specifically, spatial and temporal words are selected from the query, and the video is split into two pathways. Spatial cross-modal attention is computed first and serves as prior knowledge for temporal attention. A new localization strategy is also devised which regresses the segment's start conditioned on the end and essentially breaks the independence assumption made in previous methods. Experimental results on three public benchmark datasets show that our STDNet outperforms the state-of-the-art methods.
0 Replies

Loading