Keywords: representation learning, unsupervised segmentation, whole object segmentation, spatial understanding
Abstract: We study the problem of unsupervised object segmentation, with the aim of discovering whole objects--including both distinctive and less salient parts, rather than just visually striking fragments. Existing unsupervised methods often identify only distinctive parts (e.g., head but not torso), resulting in incomplete objects. Our key insight is that whole objects can emerge from the interplay of similarity among parts and contrast with surrounding context, both within and across images. This contrastive and contextual grouping process enables the discovery of heterogeneous object parts as unified wholes, without any predefined notion of object structure.
To this end, we propose Contrastive Contextual Grouping (CCG), a three-step framework for unsupervised whole object segmentation: 1) identifying semantically similar yet visually diverse image pairs, 2) performing co-segmentation using joint graph cuts with pairwise attraction and repulsion, and 3) distilling the results into a single-image segmentation model.
Our approach achieves state-of-the-art results across four benchmarks: unsupervised saliency detection, unsupervised object discovery, unsupervised video object segmentation, and unsupervised nuclei segmentation. Remarkably, in some settings it even rivals or exceeds the performance of a supervised foundation model, SAM2, at whole object segmentation given box prompts.
Submission Type: Long Research Paper (< 9 Pages)
Submission Number: 17
Loading