ReasonVOS: Benchmarking and Addressing Spatiotemporal-Semantic Reasoning in Instruction-Guided Video Segmentation

ACL ARR 2025 May Submission6044 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Existing approaches to Reasoning Video Object Segmentation (ReasonVOS) typically generate mask sequences based on implicit instructions combined with external world knowledge. However, these instructions often focus on static or isolated visual elements (e.g., “Which pants are gray”), neglecting the spatiotemporal dynamics intrinsic to video data. In this work, We introduce DualReasonVOS, a new benchmark for ReasonVOS that combines temporal reasoning over object dynamics with semantic reasoning over implicit language, leveraging both visual context and world knowledge. To this end, we redesign the CReaVOS dataset by incorporating carefully curated implicit instructions that emphasize spatiotemporal reasoning. Furthermore, we propose Complex Video Reasoning Segmentation Framework (CVRS), a novel framework that introduces an adaptive reasoning mechanism to decompose implicit instructions into hierarchical reasoning chains. This enables context-aware identification of query-relevant objects across diverse video scenarios. Experimental results demonstrate that CVRS significantly enhances both temporal and spatial reasoning capabilities, achieving superior mask quality compared to state-of-the-art methods on the CReaVOS and ReVOS benchmarks.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: video processing,multimodality
Contribution Types: Data resources, Data analysis
Languages Studied: English
Submission Number: 6044
Loading