Abstract: Disease diagnosis tasks using microbial data are often hindered by extreme class imbalance issues, which are further manifested as inter-class and intra-class imbalances. The former can be handled by general methods such as the SMOTE, while the latter has not been well studied. In this paper, we propose an ensemble classification algorithm based on space partitioning and data augmentation (ECSD) to address both types of imbalances. First, the data are mapped into a low-dimensional space through KPCA, LMNN, and RENN. These techniques address the data sparsity and noise in the original dataset. Second, we design a Kannoy technique to increase the distance between data points in different subspaces. In this way, the data distribution is more uniform, thus alleviating the intra-class imbalance problem. Third, a WGAN trained on the whole dataset is used to augment the data in each subspace. Different data augmentation and filtering strategies are employed to alleviate inter-class imbalance issues. Finally, base classifiers trained on each subspace are ensembled using a distance-weighted technique. The ensembler aims to provide stable predictions. Our algorithm is compared with four algorithms for handling class imbalance and three algorithms that address microbial-based diagnosis on 17 datasets. The results show that our algorithm outperforms its counterparts in terms of multiple metrics, especially when the dataset imbalance ratio is high.
Loading