The K -Means-Type Algorithms Versus Imbalanced Data DistributionsDownload PDFOpen Website

2012 (modified: 25 Apr 2023)IEEE Trans. Fuzzy Syst. 2012Readers: Everyone
Abstract: formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$K$</tex></formula> -means is a partitional clustering technique that is well-known and widely used for its low computational cost. The representative algorithms include the hard <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$k$</tex></formula> -means and the fuzzy <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$k$</tex> </formula> -means. However, the performance of these algorithms tends to be affected by skewed data distributions, i.e., imbalanced data. They often produce clusters of relatively uniform sizes, even if input data have varied cluster sizes, which is called the “uniform effect.” In this paper, we analyze the causes of this effect and illustrate that it probably occurs more in the fuzzy <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$k$</tex></formula> -means clustering process than the hard <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$k$</tex></formula> -means clustering process. As the fuzzy index <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$m$</tex></formula> increases, the “uniform effect” becomes evident. To prevent the effect of the “uniform effect,” we propose a multicenter clustering algorithm in which multicenters are used to represent each cluster, instead of one single center. The proposed algorithm consists of the three subalgorithms: the fast global fuzzy <formula formulatype="inline" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex Notation="TeX">$k$</tex> </formula> -means, Best M-Plot, and grouping multicenter algorithms. They will be, respectively, used to address the three important problems: 1) How are the reliable cluster centers from a dataset obtained? 2) How are the number of clusters which these obtained cluster centers represent determined? 3) How is it judged as to which cluster centers represent the same clusters? The experimental studies on both synthetic and real datasets illustrate the effectiveness of the proposed clustering algorithm in clustering balanced and imbalanced data.
0 Replies

Loading