Clustering by the Probability Distributions From Extreme Value Theory

Sixiao Zheng, Ke Fan, Yanxi Hou, Jianfeng Feng, Yanwei Fu

Published: 01 Jan 2023, Last Modified: 13 May 2023IEEE Trans. Artif. Intell. 2023Readers: Everyone

Abstract: Clustering is an essential task to unsupervised learning. It tries to automatically separate instances into “coherent” subsets. As one of the most well-known clustering algorithms, <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -means assigns sample points at the boundary to a unique cluster, while it does not utilize the information of sample distribution or density. Comparably, it would potentially be more beneficial to consider the probability of each sample in a possible cluster. To this end, this article generalizes <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -means to model the distribution of clusters. Our novel clustering algorithm, thus, models the distributions of distances to centroids over a threshold by the generalized Pareto distribution (GPD) in extreme value theory. Notably, we propose the concept of centroid margin distance, use the GPD to establish a probability model for each cluster, and perform a clustering algorithm based on the covering probability function derived from the GPD. Such a GPD <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -means, thus, enables the clustering algorithm from the probabilistic perspective. Correspondingly, we also introduce a naive baseline, dubbed as generalized extreme value (GEV) <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -means. The GEV fits the distribution of the block maxima. In contrast, the GPD fits the distribution of distance to the centroid exceeding a sufficiently large threshold, leading to a more stable performance of GPD <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -means. Notably, GEV <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -means can also estimate the cluster structure and, thus, perform reasonably well over classical <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -means. Thus, extensive experiments on synthetic and real datasets demonstrate that GPD <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -means outperforms competitors.

0 Replies