Clustering nominal data using unsupervised binary decision trees: Comparisons with the state of the art methods

Badih Ghattas, Pierre Michel, Laurent Boyer

2017 (modified: 15 Jan 2021)Pattern Recognit. 2017Readers: Everyone

Abstract: Highlights • An extension of clustering using binary decision trees (CUBT) is presented for nominal data. • New heuristics are given for tuning the parameters of CUBT. • CUBT outperforms many of the existing approaches for nominal datasets. • The tree structure helps for the interpretation of the obtained clusters. • The method usable for direct prediction. • The method may be used with parallel computing and thus for Big data. Abstract In this work, we propose an extension of CUBT (clustering using unsupervised binary trees) to nominal data. For this purpose, we primarily use heterogeneity criteria and dissimilarity measures based on mutual information, entropy and Hamming distance. We show that for this type of data, CUBT outperforms most of the existing methods. We also provide and justify some guidelines and heuristics to tune the parameters in CUBT. Extensive comparisons are done with other well known approaches using simulations, and two examples of real datasets applications are given.

0 Replies