Grid-based DBSCAN: Indexing and inference

Thapana Boonchoo, Xiang Ao, Yang Liu, Weizhong Zhao, Fuzhen Zhuang, Qing He

2019 (modified: 14 Nov 2021)Pattern Recognit. 2019Readers: Everyone

Abstract: Highlights • The proposed method extends grid-based DBSCAN scalable to higher-dimensional dataset. • Cluster forest is devised to alleviate redundancies in the merging step. • HyperGrid Bitmap is used to index non-empty grids for efficient neighbor grid queries. • Experiments show performance superiority of proposed method on real/synthetic data. Abstract DBSCAN is one of clustering algorithms which can report arbitrarily-shaped clusters and noises without requiring the number of clusters as a parameter (unlike the other clustering algorithms, k-means, for example). Because the running time of DBSCAN has quadratic order of growth, i.e. O(n2), research studies on improving its performance have been received a considerable amount of attention for decades. Grid-based DBSCAN is a well-developed algorithm whose complexity is improved to O(nlog n) in 2D space, while requiring Ω(n4/3) to solve when dimension  ≥ 3. However, we find that Grid-based DBSCAN suffers from two problems: neighbour explosion and redundancies in merging, which make the algorithms infeasible in high dimensional space. In this paper we first propose a novel algorithm called GDCF which utilizes bitmap indexing to support efficient neighbour grid queries. Second, based on the concept of union-find algorithm we devise a forest-like structure, called cluster forest, to alleviate the redundancies in the merging. Moreover, we find that running the cluster forest in different orders can lead to a different number of merging operations needed to perform in the merging step. We propose to perform the merging step in a uniform random order to optimize the number of merging operations. However, for high-density database, a bottleneck could be occurred, we further propose a low-density-first order to alleviate this bottleneck. The experiments resulted on both real-world and synthetic datasets demonstrate that the proposed algorithm outperforms the state-of-the-art exact/approximate DBSCAN and suggests a good scalability.

0 Replies