Generalization Performance of Ensemble Clustering: From Theory to Algorithm

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: This paper investigates the theoretical foundations of ensemble clustering, and instantiates the theory to a new clustering ensemble algorithm.
Abstract: Ensemble clustering has demonstrated great success in practice; however, its theoretical foundations remain underexplored. This paper examines the generalization performance of ensemble clustering, focusing on generalization error, excess risk and consistency. We derive a convergence rate of generalization error bound and excess risk bound both of $\mathcal{O}(\sqrt{\frac{\log n}{m}}+\frac{1}{\sqrt{n}})$, with $n$ and $m$ being the numbers of samples and base clusterings. Based on this, we prove that when $m$ and $n$ approach infinity and $m$ is significantly larger than log $n$, i.e., $m,n\to \infty, m\gg \log n$, ensemble clustering is consistent. Furthermore, recognizing that $n$ and $m$ are finite in practice, the generalization error cannot be reduced to zero. Thus, by assigning varying weights to finite clusterings, we minimize the error between the empirical average clusterings and their expectation. From this, we theoretically demonstrate that to achieve better clustering performance, we should minimize the deviation (bias) of base clustering from its expectation and maximize the differences (diversity) among various base clusterings. Additionally, we derive that maximizing diversity is nearly equivalent to a robust (min-max) optimization model. Finally, we instantiate our theory to develop a new ensemble clustering algorithm. Compared with SOTA methods, our approach achieves average improvements of 6.1\%, 7.3\%, and 6.0\% on 10 datasets w.r.t. NMI, ARI, and Purity. The code is available at https://github.com/xuz2019/GPEC.
Lay Summary: Ensemble clustering is a widely used technique that combines multiple clustering results to achieve higher robustness and accuracy. It has found applications in fields such as image analysis, customer segmentation, and bioinformatics. Despite its empirical success, its theoretical underpinnings remain largely unexplored. This work provides a rigorous analysis of the generalization performance of ensemble clustering. We derive bounds for the generalization error and excess risk, and characterize the asymptotic consistency of ensemble clustering. Our results demonstrate that increasing the number of samples alone is insufficient to guarantee performance gains; rather, the number and diversity of base clusterings are critical factors. Building upon this theoretical framework, we propose a novel weighted ensemble clustering algorithm that jointly minimizes bias and maximizes diversity across the base clusterings. Extensive experiments on real-world datasets confirm that our method consistently outperforms state-of-the-art techniques, with average improvements exceeding 6\%. This study not only advances the theoretical understanding of ensemble clustering but also offers practical insights into the design of more effective and principled clustering algorithms.
Link To Code: https://github.com/xuz2019/GPEC
Primary Area: General Machine Learning->Clustering
Keywords: Ensemble clustering, generalization performance, bias and diversity
Submission Number: 8598
Loading