Keywords: Self-supervised learning, Label propagation, Video object segmentation
TL;DR: We propose a new self-supervised learning method to learn dense features for label propagation tasks.
Abstract: Self-supervised learning (SSL) aims to learn robust and transferable representations purely from unlabeled data, which is especially useful when annotated data is scarce. Over the past decade, SSL has advanced significantly through paradigms such as Masked Image Modeling (MIM) and self-distillation. More recently, several methods have been designed for specific downstream tasks. In particular, SiamMAE introduced siamese masked auto-encoding for label propagation, where dense semantic labels from initial video frames are propagated to subsequent ones through inter-frame correspondence. CropMAE later showed that still images can achieve similar results by extracting two related crops (with random flipping to simulate a change of viewpoints between the two images) and reconstructing one from the other. While both methods are effective, they rely on reconstructing raw pixel values of masked patches, which cannot capture high-level semantics and is less robust than latent or semantic reconstruction. Building on insights from iBOT and DINOv2, we propose Crop-CoRe, an SSL method that extends CropMAE by reconstructing cluster assignments instead. In our experiments, Crop-CoRe consistently outperforms SiamMAE and CropMAE on label propagation benchmarks and achieves competitive results compared to state-of-the-art methods while requiring fewer training iterations. Moreover, it avoids reliance on video datasets or frame extraction, making it more resource-efficient. The code will be publicly released after publication.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 3464
Loading