Keywords: DBSCAN, locality-sensitive hashing, clustering
Abstract: DBSCAN is a celebrated algorithm for density-based clustering, but its quadratic runtime hinders scalability to large datasets. In recent years, there has been considerable interest in accelerating DBSCAN. However, existing methods either impose additional structure on the data (e.g., low-dimensionality), or lack rigorous runtime and approximation guarantees. Building on a recent work of Okkels et al. (2025), we propose an LSH-based algorithm that achieves the first provably subquadratic runtime for approximate DBSCAN on arbitrary high-dimensional datasets. Empirically, our algorithm delivers a significant speedup over the standard DBSCAN on a variety of benchmarks while incurring only small error. We also show that our approach naturally yields a subquadratic-time approximation of HDBSCAN (a popular hierarchical variant). Complementing our algorithms, we prove quadratic-time lower bounds for exact DBSCAN and HDBSCAN, showing that subquadratic runtimes are only possible with approximation.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 14008
Loading