Sequential Clustering for Real-World Datasets

Chongwei Huang, Jian Hou, Huaqiang Yuan

Published: 2024, Last Modified: 16 Jan 2025PRICAI (1) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This paper is intended to address two major problems in data clustering. First, the clustering results of many algorithms depend heavily on the given number of clusters. Second, while many algorithms perform well on synthetic datasets, they often generate less satisfactory results on real datasets. We propose to do clustering sequentially to tackle these two problems simultaneously. In the first step, we use the dominant set algorithm together with border-peeling and reverse nearest neighbors to obtain clusters sequentially. In the second step, we estimate the parameters of Gaussian mixture models (GMM) based on the current clustering result and then do GMM clustering to obtain the final clustering result, thereby being adapted to clusters of approximate Gaussian distribution in many real-world datasets. Our algorithm can be used as both a complete algorithm and a pre-clustering step for estimating the number of clusters. In experiments on 20 real datasets our algorithm is demonstrated to be effective in real-world data clustering.