Revisiting DP-Means: Fast Scalable Algorithms via Parallelism and Delayed Cluster CreationDownload PDF

Published: 20 May 2022, Last Modified: 05 May 2023UAI 2022 PosterReaders: Everyone
Keywords: clustering, unsupservised, DP-Means, scalable, fast, Bayesian, nonparametric, an unknown number of clusters
Abstract: DP-means, a nonparametric generalization of K-means, extends the latter to the case where the number of clusters is unknown. Unlike K-means, however, DP-means is hard to parallelize, a limitation hindering its usage in large-scale tasks. This work bridges this practicality gap by rendering the DP-means approach a viable, fast, and highly-scalable solution. First, we study the strengths and weaknesses of previous attempts to parallelize the DP-means algorithm. Next, we propose a new parallel algorithm, called PDC-DP-Means (Parallel Delayed Cluster DP-Means), based in part on delayed creation of clusters. Compared with DP-Means, PDC-DP-Means provides not only a major speedup but also performance gains. Finally, we propose two extensions of PDC-DP-Means. The first combines it with an existing method, leading to further speedups. The second extends PDC-DP-Means to a Mini-Batch setting (with an optional support for an online mode), allowing for another major speedup. We verify the utility of the proposed methods on multiple datasets. We also show that the proposed methods outperform other nonparametric methods (\emph{e.g.}, DBSCAN). Our highly-efficient code can be used to reproduce our experiments and is available at https://github.com/BGU-CS-VIL/pdc-dp-means.
Supplementary Material: zip
TL;DR: Very fast and scalable algorithms for DP-Means (a K-Means-like algorithm that does not require knowing K)
4 Replies

Loading