Abstract: Unsupervised semantic segmentation (USS) aims to identify semantically consistent regions and assign correct categories without annotations. Since the self-supervised pre-trained vision transformer (ViT) can provide pixel-level features containing rich class-aware information and object distinctions, it has recently been widely used as the backbone for unsupervised semantic segmentation. Although these methods achieve exceptional performance, they often rely on the parametric classifiers and therefore need the prior about the number of categories in advance. In this work, we investigate the process of clustering adaptively for the current mini-batch of images without having prior on the number of categories and propose Adaptive Cluster Assignment Module (ACAM) to replace parametric classifiers. Furthermore, we optimize ACAM to generate weights via the introduction of contrastive learning, which is used to re-weight features, thereby generating semantically consistent clusters. Additionally, we leverage image-text pre-trained models, CLIP, to assign specific labels to each mask obtained from clustering and pixel assignment. Our method achieves new state-of-the-art results in COCO-Stuff and Cityscapes datasets.
Loading