Systematically and efficiently improving $k$-means initialization by pairwise-nearest-neighbor smoothing

Published: 07 Dec 2022, Last Modified: 28 Feb 2023Accepted by TMLREveryoneRevisionsBibTeX
Abstract: We present a meta-method for initializing (seeding) the $k$-means clustering algorithm called PNN-smoothing. It consists in splitting a given dataset into $J$ random subsets, clustering each of them individually, and merging the resulting clusterings with the pairwise-nearest-neighbor (PNN) method. It is a meta-method in the sense that when clustering the individual subsets any seeding algorithm can be used. If the computational complexity of that seeding algorithm is linear in the size of the data $N$ and the number of clusters $k$, PNN-smoothing is also almost linear with an appropriate choice of $J$, and quite competitive in practice. We show empirically, using several existing seeding methods and testing on several synthetic and real datasets, that this procedure results in systematically better costs. In particular, our method of enhancing $k$-means++ seeding proves superior in both effectiveness and speed compared to the popular ``greedy'' $k$-means++ variant. Our implementation is publicly available at \href{https://github.com/carlobaldassi/KMeansPNNSmoothing.jl}{https://github.com/carlobaldassi/KMeansPNNSmoothing.jl}.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: Implemented all the changes suggested by the Action Editor, and generally tried to highlight the significance of the results as per the discussion. More specifically: 1. Shortened the title 2. Added a sentence to the abstract mentioning the superiority of PNNS(KM++) compared to GKM++. 3. Added Table 1, which summarizes the complexities and NDCs of all seeding methods (mentioned at the top of sec. 2) 4. Split all paragraphs in sec. 2 so to separate the discussions on complexity/NDCs of each method 5. Added two paragraphs explaining the goal of the experiments at the end of sec. 4.1 (experimental setup), as suggested 6. Added a paragraph to highlight some significant results at the end of sec. 4.2 (synthetic datasets) 7. Added a paragraph to highlight some significant results at the end of sec. 4.3 (real-world datasets) 8. Expanded the discussion to mention that a) REF(INIT) is not systematically better than INIT, contrary to PNNS(INIT), and also that we were the first to test REF(INIT) with INIT != UNIF; b) PNNS(KM++) is superior to any alternative scheme that enhances KM++, i.e. GKM++, REF(KM++), REF(GKM++) and thus that it is a better default than GKM++. Also: de-anonymized the paper, added link to the github repository.
Code: https://github.com/carlobaldassi/KMeansPNNSmoothing.jl
Assigned Action Editor: ~Aditya_Menon1
License: Creative Commons Attribution 4.0 International (CC BY 4.0)
Submission Number: 333
Loading