Abstract: Referring video object segmentation (RVOS) is a cross-modal task that aims to segment the target object described by language expressions. A video typically consists of multiple frames and existing works conduct segmentation at either the clip-level or the frame-level. Clip-level methods process a clip at once and segment in parallel, lacking explicit inter-frame interactions. In contrast, frame-level methods facilitate direct interactions between frames by processing videos frame by frame, but they are prone to error accumulation. In this paper, we propose a novel tracking-forced framework, introducing high-quality tracking information and forcing the model to achieve accurate segmentation. Concretely, we utilize the ground-truth segmentation of previous frames as accurate inter-frame interactions, providing high-quality tracking references for object segmentation in the next frame. This decouples the current input from the previous output, which enables our model to concentrate on accurately segmenting just based on given tracking information, improving training efficiency and preventing error accumulation. For the inference stage without ground-truth masks, we carefully select the beginning frame to construct tracking information, aiming to ensure accurate tracking-based frame-by-frame object segmentation. With these designs, our tracking-forced method significantly outperforms existing methods on 4 widely used benchmarks by at least 3%. Especially, our method achieves 88.3% P@0.5 accuracy and 87.6 overall IoU score on the JHMDB-Sentences dataset, surpassing previous best methods by 5.0% and 8.0, respectively.
Primary Subject Area: [Content] Vision and Language
Relevance To Conference: Our work significantly improve the performance of the cross-modal task referring video object segmentation.
Supplementary Material: zip
Submission Number: 483
Loading