A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data

Zhaozhao Xu; Derong Shen; Tiezheng Nie; Yue Kou; Nan Yin; Xi Han

A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data

Zhaozhao Xu, Derong Shen, Tiezheng Nie, Yue Kou, Nan Yin, Xi Han

Published: 01 Jan 2021, Last Modified: 17 Apr 2025Inf. Sci. 2021EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The algorithm of C4.5 decision tree has the advantages of high classification accuracy, fast calculation speed and comprehensible classification rules, so it is widely used for medical data analysis. However, for imbalanced medical data, the classification accuracy of decision trees-based models is not ideal. Therefore, this paper proposes a cluster-based oversampling algorithm (KNSMOTE) combining Synthetic minority oversampling technique (SMOTE) and k-means algorithm. The sample classes clustered by k-means and the original sample classes are calculated to select the ‘‘safe samples” whose sample classes have not been changed. The ‘‘safe samples” are linearly interpolated to synthesize the new samples. The improved SMOTE sets the oversampling ratio according to the imbalance ratio of the original samples, which is used to synthesize the samples whose number is the same as that of the original samples. Compared with other oversampling algorithms on 8 UCI datasets, our algorithm has achieved significant advantages. Our algorithm was applied to the medical datasets, and the average values of the Sensitivity and Specificity indexes of the Random forest (RF) algorithm were 99.84% and 99.56%, respectively.

Loading