Keywords: Active Constrained Clustering, Probabilistic Clustering
Abstract: Active Constrained Clustering (ACC) is a widely used semi-supervised clustering framework to improve clustering quality through progressive annotation of informative pairwise constraints. However, the application of existing ACC methods to large datasets with numerous classes incurs high computational or query expenses.
In this paper, we conduct a theoretical analysis of the inefficiency of sample-based ACC and the rationale behind cluster-based ACC. Moreover, we provide the theoretical guarantee for cluster fusion under a certain purity constraint and a clustering quality constraint with respect to normalized mutual information (NMI).
Drawing on these theoretical insights, we introduce a novel Active Probabilistic Clustering (APC) framework designed to scale effectively with large datasets. Compared to previous methods, APC demonstrates superior performance across eight datasets of varying sizes (ranging from 350 to 100,000 samples) in terms of clustering quality, query cost, and computational expense. Specifically, APC accomplishes satisfactory clustering outcomes (e.g., NMI $>0.95$) using 3,920 queries on a dataset with 100,000 samples, while baseline methods yield inferior clustering results (e.g., NMI $\leq0.85$) with 10,000 queries. Concurrently, APC operates at a speed 100x faster than baseline methods.
Primary Area: general machine learning (i.e., none of the above)
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 2685
Loading