Internal Purity: A Differential Entropy based Internal Validation Index for Clustering ValidationDownload PDF

Published: 01 Feb 2023, Last Modified: 13 Feb 2023Submitted to ICLR 2023Readers: Everyone
Abstract: In an effective process of cluster analysis, it is indispensable to validate the goodness of different partitions after clustering. Existing internal validation indices are implemented based on distance, variance and model-selection. The indices based on distance or variance cannnot catpure the real ``density" of the cluster and the time complexity for distance based indices is usually too high to be applied for large datasets. Moreover, the indices based on model-selection tend to overestimate the number of cluster in clustering validation. Therefore, we propose a novel internal validation index based on the differential entropy, named \textit{internal purity} (IP). The proposed IP index can effectively measure the purity of a cluster without using the external cluster information, and successfully overcome the drawbacks of existing internal indices. Based on six powerful deep pre-trained models and without further fine-tuning using the experimental datasets, we use four different clustering algorithms to compare our index with thirteen other well-known internal indices on five text and five image datasets. The results show that, for 60 test cases in total, our IP index can return the optimal clustering results in 43 cases while the second best index can merely report the optimal partition in 17 cases, which demonstrates the significant superiority of our IP index when validating the goodness of the clustering results. Moreover, theoretical analysis for the effectiveness and efficiency of the proposed index are also provided.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Unsupervised and Self-supervised learning
12 Replies

Loading