Abstract: k-means is a clustering algorithm used to group observations into clusters. Due to the multidimensionality of datasets, interpreting clustering results has become increasingly challenging. In response, sparse clustering variants have emerged, allowing each feature to be weighed. In the sparse k-means algorithm, feature weights are computed based on the values of their associated observations. However, the sparse k-means algorithm is known to be sensitive to outliers. Hence, robust sparse k-means variants have emerged, performing sparse k-means while detecting outliers. In numerous real-world cases, data entry or measurement errors can lead to poorly collected values for a feature, making them significantly different from other values in that feature. Due to dataset multidimensionality, these observations are often not detected as outliers by existing robust approaches. This negatively impacts the evaluation of feature weights, biases the interpretability of results and leads to poor clustering quality. To fill this gap, this paper introduces a new robust sparse k-means framework consisting of a new robust initialization and a detection method of these observations. The proposed robust initialization method shows robustness in terms of the observations chosen as initial centers; the proposed sparse k-means shows an improvement in feature selection, interpretability and clustering quality compared to other robust variants on several real and synthetic datasets.
External IDs:dblp:journals/ijdsa/MadjoukengKF25
Loading