Towards Long-Form Spatio-Temporal Video Grounding

Xin Gu; Bing Fan; Jiali Yao; Zhipeng Zhang; Yan Huang; Cheng Han; Heng Fan; Libo Zhang

Towards Long-Form Spatio-Temporal Video Grounding

Xin Gu, Bing Fan, Jiali Yao, Zhipeng Zhang, Yan Huang, Cheng Han, Heng Fan, Libo Zhang

07 Sept 2025 (modified: 15 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Spatio-Temporal Video Grounding

Abstract: Videos can span several minutes or even hours in real scenarios, yet current research on spatio-temporal video grounding (STVG), given a textual query, mainly focuses on localizing target from a video of tens of seconds, typically less than one minute, limiting its applications. In this paper, we explore $\textbf{L}$ong-$\textbf{F}$orm $\textbf{STVG}$ ($\textbf{LF-STVG}$), that aims to locate the target in long-term videos. In LF-STVG, long-term videos encompass a much longer temporal span and more irrelevant information, making it challenging for current short-form STVG models that process all frames at once. Addressing these, we introduce a novel $\textbf{A}$uto$\textbf{R}$egressive $\textbf{T}$ransformer framework for LF-$\textbf{STVG}$ ($\textbf{ART-STVG}$). Unlike current STVG methods requiring seeing the entire sequence to make a full prediction at once, our ART-STVG treats the video as a streaming input and processes its frames sequentially, making it capable of easily handling the long videos. To capture spatio-temporal context in ART-STVG, spatial and temporal memory banks are developed and applied to decoders of ART-STVG. Considering that memories at different moments are not always relevant for localizing the target in current frame, we propose simple yet effective memory selective strategies that enable more relevant information for the decoders, greatly improving performance. Moreover, rather than parallelizing spatial and temporal localization as done in existing approaches, we introduce a novel cascaded spatio-temporal design that connects spatial decoder to temporal decoder during grounding. This way, our ART-STVG leverages more fine-grained target information to assist with complicated temporal localization in complex long videos, further boosting the performance. On the newly extended datasets for LF-STVG, ART-STVG largely outperforms current state-of-the-art approaches, while showing competitive results on conventional Short-Form STVG. Our code and models will be released.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 2705

Loading