An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

An Empirical Study into Clustering of Unseen Datasets with Self-Supervised Encoders

TMLR Paper8890 Authors

12 May 2026 (modified: 25 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Can pretrained models generalize to new datasets without any retraining? We deploy pretrained image models on datasets they were not trained for, and investigate whether their embeddings form meaningful clusters. Our suite of benchmarking experiments use encoders pretrained solely on ImageNet-1k with either supervised or self-supervised training techniques, deployed on image datasets that were not seen during training, and clustered with conventional clustering algorithms. This evaluation provides new insights into the embeddings of self-supervised models, which prioritize different features to supervised models. Supervised encoders typically offer more utility than SSL encoders within the training domain, and vice-versa far outside of it, however, fine-tuned SSL encoders demonstrate the opposite trend. Clustering provides a way to evaluate the utility of self-supervised learnt representations orthogonal to existing feature quality estimation methods. Additionally, we find the silhouette score when measured in a UMAP-reduced space is highly correlated with clustering performance, and can therefore be used as a proxy for clustering performance on data with no ground truth labels.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Adin_Ramirez_Rivera1

Submission Number: 8890

Loading