Abstract: Unsupervised semantic segmentation algorithms aim to identify meaningful semantic groups without annotations. Recent approaches leveraging self-supervised transformers as pre-training backbones have successfully obtained high-level dense features that effectively express semantic coherence. However, these methods often overlook local semantic coherence and low-level features such as color and texture. We propose integrating low-level visual cues to complement high-level visual cues derived from self-supervised pre-training branches. Our findings indicate that low-level visual cues provide a more coherent recognition of color-texture aspects, ensuring the continuity of spatial structures within classes. This insight led us to develop IL2Vseg, an unsupervised semantic segmentation method that leverages the complementation of low-level visual cues. The core of IL2Vseg is a spatially-constrained fuzzy clustering algorithm based on color affinities, which preserves the intra-class affinity of spatially-adjacent and similarly-colored pixels in low-level visual cues. Additionally, to effectively couple low-level and high-level visual cues, we introduce a feature similarity loss function to optimize the feature representation of fused visual cues. To further enhance consistent feature learning, we incorporate contrast loss functions based on color invariance and luminosity invariance, which improve the learning of features from different semantic categories. Extensive experiments on multiple datasets, including COCO-Stuff-27, Cityscapes, Potsdam, and MaSTr1325, demonstrate that IL2Vseg achieves state-of-the-art results.
Loading