More Than a Snapshot: Forcing Temporal Reasoning in Video Segmentation

More Than a Snapshot: Forcing Temporal Reasoning in Video Segmentation

ICLR 2026 Conference Submission74 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Video Reasoning Segmentation, Temporal Dynamics

Abstract: Video Reasoning Segmentation (VRS) inherits the settings of reasoning based on world knowledge and spatial contents, lacking queries demanding temporal reasoning according to the unique temporal dynamics of videos. To bridge the gap, we introduce TempVRS, a large-scale Temporal Video Reasoning Segmentation dataset containing 30k videos and 200k queries injecting temporal dynamics. Moreover, existing VRS methods commonly employ a three-stage paradigm: keyframe selection, reasoning and propagation. However, such paradigm not only neglects temporal dynamics inherent in videos which results in non-negligible deviations of keyframe selections, but also hinders video understanding, leading to the degradation of video reasoning into isolated keyframe analysis. To address the defects of such paradigm, we propose a temporal video reasoning segmentation method to stimulate the inherent temporal-reasoning capabilities of multi-modal large language model. Through interleaving uniform-sampled video frames across spatial dimension and explicitly injecting spatiotemporal distribution, our 4B-method can achieve comparable performance with Sa2VA-8B under the same inference settings, significantly improving the accuracy when evaluated on existing referring/reasoning video segmentation benchmarks (e.g., $5.5\%$% and $3.4\%$% increases compared to Sa2VA-4B on MeViS and ReVOS).

Primary Area: foundation or frontier models, including LLMs

Submission Number: 74

Loading