Tri-training and Data Editing Based Semi-supervised Clustering Algorithm

Chao Deng, Maozu Guo

Published: 2006, Last Modified: 19 May 2025MICAI 2006EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Seeds based semi-supervised clustering algorithms often utilize a seeds set consisting of a small amount of labeled data to initialize cluster centroids, hence improve the performance of clustering over whole data set. Researches indicate that both the scale and quality of seeds set greatly restrict the performance of semi-supervised clustering. A novel semi-supervised clustering algorithm named DE-Tri-training semi-supervised K means is proposed. In new algorithm, prior to initializing cluster centroids, the training process of a semi-supervised classification approach named Tri-training is used to label the unlabeled data and add them into initial seeds to enlarge the scale. Meanwhile, to improve the quality of enlarged seeds set, a Nearest Neighbor Rule based data editing technique named Depuration is introduced into the Tri-training process to eliminate and correct the noise and mislabeled data among the enlarged seeds. Experiments show that novel algorithm can effectively improve the initialization of cluster centroids and enhance clustering performance.