Provably Fast Density-Based Clustering in High Dimensions

Provably Fast Density-Based Clustering in High Dimensions

ICLR 2026 Conference Submission14008 Authors

18 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: DBSCAN, locality-sensitive hashing, clustering

Abstract: DBSCAN is a celebrated algorithm for density-based clustering, but its quadratic runtime hinders scalability to large datasets. In recent years, there has been considerable interest in accelerating DBSCAN. However, existing methods either impose additional structure on the data (e.g., low-dimensionality), or lack rigorous runtime and approximation guarantees. Building on a recent work of Okkels et al. (2025), we propose an LSH-based algorithm that achieves the first provably subquadratic runtime for approximate DBSCAN on arbitrary high-dimensional datasets. Empirically, our algorithm delivers a significant speedup over the standard DBSCAN on a variety of benchmarks while incurring only small error. We also show that our approach naturally yields a subquadratic-time approximation of HDBSCAN (a popular hierarchical variant). Complementing our algorithms, we prove quadratic-time lower bounds for exact DBSCAN and HDBSCAN, showing that subquadratic runtimes are only possible with approximation.

Supplementary Material: zip

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Submission Number: 14008

Loading