An Investigation of SMOTE Based Methods for Imbalanced Datasets with Data Complexity Analysis (Extended Abstract)

Nur Athirah Azhar, Muhammad Syafiq Mohd Pozi, Aniza Mohamed Din, Adam Jatowt

Published: 2024, Last Modified: 05 Feb 2025ICDE 2024EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: This extended abstract highlights challenges with imbalanced datasets in real-world applications, where issues like noise, class overlap, and small subsets of data impact classification accuracy. While the Synthetic Minority Oversampling Technique (SMOTE) addresses imbalanced datasets by increasing minority class examples, it struggles with handling these data complexities and might worsen the situation. As a result, several SMOTE variants have emerged, aiming to improve its effectiveness by integrating it with other methods or altering its approach. This paper offers a comparative analysis of these variants, examining how each tackles specific data complexities. Through experiments on 24 imbalanced datasets, changes in complexity measures resulting from these SMOTE variants, in terms of F1-Score and data complexity metrics are observed and demonstrated.