Meta-Learned Surrogates for Clustering Model Selection

TMLR Paper8882 Authors

11 May 2026 (modified: 22 May 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Clustering model selection without ground-truth labels relies on Internal Validity Measures (IVMs) such as Silhouette, Calinski-Harabasz, and Davies-Bouldin. These fixed surrogates encode particular geometric assumptions and often correlate poorly with external agreement across heterogeneous datasets. We propose MetaIVM, a meta-learned surrogate for external-agreement-based clustering model selection. Trained offline on labeled benchmarks and deployed without labels, MetaIVM predicts the quality of individual (dataset, partition) pairs from observable features of partition structure, dataset statistics, and graph topology. Unlike prior meta-learning work that recommends algorithms at the dataset level, this per-partition formulation handles algorithm choice, hyperparameter selection, and cluster-count selection in a single framework. On a benchmark of 223 datasets and 16,889 clustering runs, MetaIVM reduces selection regret by 67% over the best classical IVM. Principled controls show that neither learning from IVMs alone nor dataset-level meta-selection suffices: the per-partition formulation is essential, and even linear regression with our features outperforms all IVMs. The method adapts its feature reliance to dataset geometry, is robust across four external metrics and across modified candidate pools, and transfers from synthetic to real-world domains, though performance depends on the training distribution. As a preliminary extension to graph community detection, where coordinate-based IVMs do not apply, MetaIVM outperforms modularity-based selection.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Emanuele_Sansone1
Submission Number: 8882
Loading