Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video GroundingDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 14 Apr 2024CVPR 2023Readers: Everyone
Abstract: Spatio-Temporal Video Grounding (STVG) aims to localize the target object spatially and temporally according to the given language query. It is a challenging task in which the model should well understand dynamic visual cues (e.g., motions) and static visual cues (e.g., object appearances) in the language description, which requires effective joint modeling of spatiotemporal visuallinguistic dependencies. In this work, we propose a novel framework in which a static vision-language stream and a dynamic vision-language stream are developed to collaboratively reason the target tube. The static stream performs cross-modal understanding in a single frame and learns to attend to the target object spatially according to intraframe visual cues like object appearances. The dynamic stream models visual-linguistic dependencies across multiple consecutive frames to capture dynamic cues like motions. We further design a novel cross-stream collaborative block between the two streams, which enables the static and dynamic streams to transfer useful and complementary information from each other to achieve collaborative reasoning. Experimental results show the effectiveness of the collaboration of the two streams and our overall frame-work achieves new state-of-the-art performance on both HCSTVG and VidSTG datasets.
0 Replies

Loading