Scalable k-Means Clustering for Large k via Seeded Approximate Nearest-Neighbor Search

Published: 12 Jun 2025, Last Modified: 06 Jul 2025VecDB 2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: k-means, clustering, ANN, high-dimensional, KNN, HNSW, nearest neighbor search
TL;DR: Faster k-means for large cluster counts
Abstract: For very large values of k, we consider methods for fast k-means clustering of massive datasets with $10^7 \sim 10^9$ points in high-dimensions ($d \geq 100$). All current practical methods for this problem have runtimes at least $\Omega(k^2)$. We find that initialization routines are not a bottleneck for this case. Instead, it is critical to improve the speed of Lloyd's local-search algorithm, particularly the step that reassigns points to their closest center. Attempting to improve this step naturally leads us to leverage approximate nearest-neighbor search methods, although this alone is not enough to be practical. Instead, we propose a family of problems we call "Seeded Approximate Nearest-Neighbor Search", for which we propose "Seeded Search-Graph" methods as a solution.
Submission Number: 8
Loading