Abstract: Highlights•This study explores the potential of vision foundation models using diverse prompt strategies and proposes a mask-free approach for weakly supervised video object segmentation.•To enhance the effectiveness of prompt learning in diverse and complex video scenes, we introduce a spatial–temporal decoupled deformable attention mechanism to establish a strong correlation between intra- and inter-frame features.•Extensive experiments on the benchmark datasets demonstrate the superior performance of the proposed approach without mask supervision compared to existing mask-supervised methods, and its ability to generalize to weakly-annotated video datasets.
Loading