Keywords: Unsupervised learning, Image Clustering
Abstract: Large-scale foundation models such as CLIP and DINOv2 provide powerful pre-trained visual embeddings that enable strong zero-shot transfer and facilitate unsupervised learning. However, for specific tasks, the visual embeddings extracted from these foundation models may still lack sufficient classification separability, making it challenging to identify a reliable classifier in the embedding space.
To address this, we propose an unsupervised learning approach with spatial embedding and human labeling (SEAL). SEAL first extracts spatial embeddings using a graph attention network to capture relational cues among image patches. These spatial embeddings are then fused with foundation model features via mutual distillation, producing spatially aware representations with enhanced separability. Subsequently, a lightweight linear classifier is trained in this embedding space to generate cluster assignments that reflect human labeling. Experimental analysis on 26 benchmark datasets shows that incorporating spatial embeddings significantly improves triplet accuracy, demonstrating the enhanced separability of foundation model embeddings. Extensive experiments further show that SEAL achieves outstanding clustering performance across 26 benchmark datasets and maintains excellent stability across 7 foundation model backbones. The code will be released publicly.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 6701
Loading