A Cluster Validity Protocol to Evaluate Internal Indices for Clustering of High-Dimensional Datasets
Abstract: Clustering high-dimensional data is a well-known challenge many data scientists face, especially when dealing with methods based on distance, which are often affected by increasing dimensionality. Considering many dimensions may be irrelevant, masking existing clusters in noisy data, Subspace Clustering is a practical approach to finding clusters within different subspaces of dimensions. Moreover, Soft Subspace Clustering (SSC) assigns a weight to each feature in different clusters without transforming or reducing its feature space. However, the performance evaluation of SSC is usually made by external measures, which require the label of each object to be clustered. In another direction, several internal Fuzzy Cluster Validity Indices (FCVIs) have been proposed to evaluate the performance of the well-known soft clustering named Fuzzy c-Means (FCM). Aiming to improve the clustering of high-dimensional data, we have investigated whether SSC performs better than FCM with the appropriate parameters and evaluation indices. The assumptions made to build our research resulted in a novel protocol for evaluating SSCs using FCVIs. To provide firm conclusions about our proposal, we compared the performance of four clustering algorithms over thirty-seven high-dimensional datasets validated by nineteen FCVIs.
Loading