Open-Vocabulary Saliency-Guided Progressive Refinement Network for Unsupervised Video Object Segmentation

Zhidong Han, Shenglong Hu, Huihui Song, Kaihua Zhang

Published: 01 Jan 2025, Last Modified: 05 Nov 2025ICASSP 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Existing leading unsupervised video object segmentation (UVOS) paradigm often leverages a dual-stream architecture with motion and appearance branches, where only the motion cues from optical flow are used as a guide to locating the primary foreground objects. When suffering from challenging factors such as static scenes, fast camera shaking, severe motion blur, etc., the estimated optical flow is noisy with low quality, leading to erroneous primary foreground objects estimation. To address this issue, we propose an open-vocabulary saliency-guided progressive refinement network for UVOS, dubbed as OVSNet. It is observed that most of the primary foreground objects also demonstrate saliency characteristics in the appearance branch. Based on this, our OVSNet complements motion cues with saliency cues predicted by a series of foundation models equipped with favorable zero-shot generalization capabilities. Specifically, we first leverage the off-the-shelf contrastive vision-language pre-training (CLIP) and CLIPSeg to generate an OVS attention map as saliency cues. Then, the saliency cues together with motion cues prompt the segment anything model (SAM) to generate a location map. In the location process, we design two lightweight adapters to fine-tune SAM, which makes SAM well adapt to the downstream UVOS task. Finally, the location map generated by SAM is used to progressively guide object representation refinement in the appearance branch, ultimately achieving an accurate segmentation mask prediction. Extensive evaluations on DAVIS-16, FBMS, and YouTube-Objects demonstrate the favorable performance of our OVSNet over the state-of-the-art methods.

External IDs:dblp:conf/icassp/HanHS025