Do we need rebalancing strategies? A theoretical and empirical study around SMOTE and its variants
TL;DR: Theoretical formalization of SMOTE's properties illustrated with experiments on synthetic and real world data set.
Abstract: Synthetic Minority Oversampling Technique (SMOTE) is a common rebalancing strategy for handling imbalanced tabular data sets. However, few works analyze SMOTE theoretically. In this paper, we derive several non-asymptotic upper bound on SMOTE density. From these results, we prove that SMOTE (with default parameter) tends to copy the original minority samples asymptotically. We confirm and illustrate empirically this first theoretical behavior on a real-world data-set.
Furthermore, we prove that SMOTE density vanishes near the boundary of the support of the minority class distribution. We then adapt SMOTE based on our theoretical findings to introduce two new variants.
These strategies are compared on $13$ tabular data sets with $10$ state-of-the-art rebalancing procedures, including deep generative and diffusion models. First, for most data sets, applying no rebalancing strategy is competitive in terms of predictive performances, would it be with LightGBM, tuned random forests or logistic regression. Second, when the imbalance ratio is artificially augmented, one of our two modifications of SMOTE leads to promising predictive performances compared to SMOTE and other strategies.
Submission Number: 883
Loading