Visual-guided Query with Temporal Interaction for Video Object Segementation

Published: 2024, Last Modified: 12 Nov 2025ICME 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: The task of referring video object segmentation (RVOS) involves segmenting objects in video frames based on a given text description. However, most existing approaches treat the text directly as a query, neglecting the valuable visual and temporal information from the video. This limitation may cause the query unable to accurately perceive the target object. To address this issue, we introduce a visual-guided query with temporal interaction for referring video object segmentation (VQTI) approach. Our method capitalizes on frame-level features and video-level features to guide the query generation process, resulting in an enhanced perception of the target object. In addition, we introduce a spectral-guided segmentation optimizer module to enhance the fine-grained information, leading to more precise segmentation masks. Extensive experiments shows competitive performance against state-of-the-art approaches.
Loading