ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Dynamic Videos

Dohwan Ko; Sihyeon Kim; Yumin Suh; Vijay Kumar b g; Minseo Yoon; Manmohan Chandraker; Hyunwoo J. Kim

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Dynamic Videos

Dohwan Ko, Sihyeon Kim, Yumin Suh, Vijay Kumar b g, Minseo Yoon, Manmohan Chandraker, Hyunwoo J. Kim

04 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: spatio-temporal reasoning, vision-language models, kinematic instructions

Abstract: Spatio-temporal reasoning is essential for understanding real-world environments in various fields, $\textit{e.g.}$, autonomous driving and sports analytics. While recent advances have strengthened the spatial reasoning abilities of Vision-Language Models (VLMs) through large-scale training data, these models still struggle with kinematic aspects such as traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark for kinematic instruction tuning, referred to as $\textbf{STKit}$ and $\textbf{STKit-Bench}$. They consist of real-world videos with 3D annotations that capture object motion dynamics, including traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale data construction to videos without 3D annotations, we propose an automatic pipeline for generating pseudo-labels via 4D reconstruction at a real-world scale. Building on this kinematic instruction tuning data, we introduce $\textbf{ST-VLM}$, a VLM enhanced for spatio-temporal reasoning, which achieves strong performance on STKit-Bench. Moreover, ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on comprehensive spatio-temporal reasoning benchmarks. Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning grounded in kinematics.

Supplementary Material: zip

Primary Area: foundation or frontier models, including LLMs

Submission Number: 1821

Loading