Abstract: Deep image clustering methods are typically evaluated on small-scale balanced
classification datasets while feature-based k-means has been applied on proprietary
billion-scale datasets. In this work, we explore the performance of feature-based
deep clustering approaches on large-scale benchmarks whilst disentangling the
impact of the following data-related factors: i) class imbalance, ii) class granular-
ity, iii) easy-to-recognize classes, and iv) the ability to capture multiple classes.
Consequently, we develop multiple new benchmarks based on ImageNet21K. Our
experimental analysis reveals that feature-based k-means is often unfairly evalu-
ated on balanced datasets. However, deep clustering methods outperform k-means
across most large-scale benchmarks. Interestingly, k-means underperforms on
easy-to-classify benchmarks by large margins. The performance gap, however,
diminishes on the highest data regimes such as ImageNet21K. Finally, we find that
non-primary cluster predictions capture meaningful classes (i.e. coarser classes).
Loading