TL;DR: VideoSEG-O3 establishes the first multi-turn RL framework for RVOS, integrating iterative temporal-spatial CoT exploration with SEG-aware logit calibration to directly optimize latent [SEG] embeddings.
Abstract: Reasoning Video Object Segmentation (RVOS) demands a sophisticated integration of temporal dynamics, spatial details, and linguistic reasoning to achieve precise pixel-level localization. Existing methods are limited to reasoning over fixed initial inputs and lack the capacity to actively acquire further visual evidence, which is often essential for resolving complex references in long or intricate videos. To address this, we propose $\textbf{VideoSEG-O3}$, the first multi-turn reinforcement learning framework for RVOS that emulates the human $\textit{``coarse-to-fine''}$ cognitive process. It employs a $\textit{multi-turn temporal-spatial chain-of-thought}$ to capture fine-grained details by iteratively pinpointing critical intervals and keyframes. Additionally, to enable the policy to perceive segmentation quality beyond mere text probability of $\texttt{[SEG]}$ during the RL stage, we introduce $\textit{SEG-aware logit calibration}$, which integrates pixel-wise segmentation feedback directly into the token-level logits. Furthermore, we design a $\textit{decoupled thinking trace}$ to hierarchically decompose the reasoning process into temporal, spatial, and linguistic dimensions, and construct $\textbf{VTS-CoT}$, a specialized cold-start dataset featuring comprehensive reasoning trajectories. Extensive experiments demonstrate that VideoSEG-O3 achieves advanced performance across 8 mainstream RVOS benchmarks, particularly excelling in long-horizon and complex reasoning tasks.
Lay Summary: Videos often contain many moving objects, and people may refer to a target using complex descriptions involving actions, timing, or relationships. Existing video segmentation methods usually rely on a fixed set of sampled frames, so they can miss the brief but crucial visual evidence needed to identify the correct object.
We tackle this with VideoSEG-O3, a system that analyzes videos through multi-step visual exploration. It first forms a broad understanding of the video, then actively selects important time intervals and key frames to inspect more closely before producing the final object mask. We also design a training strategy that connects the model’s reasoning process with pixel-level mask quality, helping it learn both where to look and how to segment the target accurately.
This research makes language-guided video object segmentation more reliable, especially for long videos and descriptions that require reasoning about motion or events. It can support applications such as video editing, robotics, surveillance analysis, and assistive visual tools, where systems need to find the right object from natural language instructions.
Link To Code: https://github.com/Dmmm1997/VideoSEG-O3
Primary Area: Reinforcement Learning->Deep RL
Keywords: Reasoning Video Object Segmentation; Reinforcement Learning; MLLM
Originally Submitted PDF: pdf
Submission Number: 2377
Loading