Validating Arbitrary Shaped Clusters - A Survey

Published: 01 Jan 2024, Last Modified: 01 Apr 2025DSAA 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Clustering is a fundamental method for advanced data analytics. Not only the selection of a suitable clustering method, but also the choice of the resulting clustering, which complies with the application requirements and the data-analytical hypotheses, is a challenge for complex analysis settings. While there exists a multitude of different clustering algorithms for the computation of simple convex up to arbitrary shaped clusters, the question of how to quantify the quality of each individual clustering remains a challenge. In this paper, we investigate the ability of state-of-the-art Clustering Validation Indices (CVI) to assess clustering performance. To this end, we provide a survey of the inner workings of the different CVI and an extensive benchmark on 180 publicly available datasets. Furthermore, we evaluate both the Euclidean distance and the density-based DC-distance to quantify the quality of arbitrary shaped clusters. Our performance evaluation indicates that no singular CVI performs significantly better than the others in general and that the density-based DC-distance is well suited for finding arbitrary shaped clusters even with CVI not specifically designed for this task. Moreover, we discovered that no single CVI effectively performs well for both arbitrary shaped and overlapping clusters at the same time. Our survey provides a comprehensive analysis of CVI from both a theoretical and a practical point of view, and is thus a useful guideline for researchers and practitioners in academia and business.
Loading