The SMOTE Paradox: Why a 92% Baseline Collapsed to 6%—A Systematic Review of 821 Papers in Imbalanced Learning (2020–2025)
Abstract: Class imbalance pervades production systems—fraud detection, medical diagnosis, industrial
monitoring—yet handling it effectively remains challenging. For two decades, SMOTE has
been the default solution, but practitioners increasingly abandon it at scale.
We investigate this disconnect through a systematic review of 821 DBLP papers (2020–2025)
and bibliometric analysis of 4,985 Scopus records. Our analysis reveals the SMOTE Paradox:
while 24% of papers mention SMOTE in titles or abstracts, only 6% of scale-focused, high-
impact papers successfully executed SMOTE at full dataset scale due to memory exhaustion
or preprocessing bottlenecks. The field has fragmented: generative models and cost-sensitive
losses each account for ∼30% of recent solutions, while alternative paradigms (including
hybrid and decoupled approaches) comprise the remainder.
Three factors explain SMOTE’s decline. First, O(N · Nmin · d) nearest-neighbor search
requires 1.28 TB of memory for a representative modern dataset. Second, linear interpola-
tion produces off-manifold artifacts scaling as √d in high dimensions. Third, CPU-bound
preprocessing creates friction with GPU-centric training pipelines.
We validate these findings through controlled experiments on seven tabular benchmark
datasets with tree-based classifiers (196 trials, imbalance ratios 1.1:1 to 129:1). We adopted
Average Precision (PR-AUC) as the primary metric and report ROC-AUC as a secondary
metric for completeness. Statistical testing reveals no significant PR-AUC differences be-
tween SMOTE and cost-sensitive baselines (Friedman p = 0.9469), despite SMOTE incurring
2.7× computational overhead. However, cost-sensitive methods severely degrade at extreme
imbalance (> 40:1), while SMOTE maintains performance where computationally feasible.
Taken together, our bibliometric, theoretical, and empirical results provide a three-way
triangulation of SMOTE’s decline in contemporary imbalanced learning
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: latest changes:
-PR-AUC as primary metric(Sections 1.7, 6, Abstract)
-Background section added (Section 1.2)
-Section 5 reframed (Section 5, opening)
Assigned Action Editor: ~Ju_Sun1
Submission Number: 6827
Loading