Abstract: Current video object segmentation methods heavily rely on pixel-level mask annotations when training, which are expensive and time-consuming to acquire. To address this problem, some approaches try to train with sparse scribble annotations and take sparse target scribble as initial information for inference. However, due to the sparsity of scribble annotations, the performance is often limited, and the corresponding loss function needs to be designed. Inspired by the powerful ability of Segment Anything Model (SAM) to leverage prompt for segmentation, we argue that this problem can be alleviated by improving the quality of scribble. Therefore, we propose SEVOS, a framework for scribble-supervised video object segmentation, which contains a scribble enhancement algorithm and an semi-supervised video object segmentation network. Specifically, the scribble enhancement algorithm first samples corresponding positive sample points and negative sample points from target scribbles, and then feeds them into the SAM in turn, achieving high-quality scribble enhancement without human intervention. This algorithm augments the scribble-annotated video dataset, which is used for additional training of the model. Furthermore, we design a post-processing enhancement algorithm to further improve the prediction results. The obtained model outperforms state-of-the-art methods with a considerable performance gap, indicating the generalization and effectiveness of the proposed model.
Loading