The SMOTE Paradox: Why a 92% Baseline Collapsed to 6%—A Systematic Review of 821 Papers in Imbalanced Learning (2020–2025)

TMLR Paper6827 Authors

06 Jan 2026 (modified: 03 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Class imbalance pervades production systems—fraud detection, medical diagnosis, industrial monitoring—yet handling it effectively remains challenging. For two decades, SMOTE has been the default solution, but practitioners increasingly abandon it at scale. We investigate this disconnect through a systematic review of 821 DBLP papers (2020–2025) and bibliometric analysis of 4,985 Scopus records. Our analysis reveals the SMOTE Paradox: while 24% of papers mention SMOTE in titles or abstracts, only 6% of scale-focused, high- impact papers successfully executed SMOTE at full dataset scale due to memory exhaustion or preprocessing bottlenecks. The field has fragmented, with 30% adopting generative mod- els, 30% using cost-sensitive losses, and 40% employing hybrid approaches. Three factors explain SMOTE’s decline. First, O(N · Nmin · d) nearest-neighbor search requires 1.28 TB of memory for a representative modern dataset. Second, linear interpola- tion produces off-manifold artifacts scaling as √d in high dimensions. Third, CPU-bound preprocessing creates friction with GPU-centric training pipelines. We validate these findings through controlled experiments on seven tabular benchmark datasets with tree-based classifiers (196 trials, imbalance ratios 1.1:1 to 129:1). Statisti- cal testing reveals no significant ROC-AUC differences between SMOTE and cost-sensitive baselines (Friedman p = 0.907), despite SMOTE incurring 2.7× computational overhead. However, cost-sensitive methods severely degrade at extreme imbalance (> 40:1), while SMOTE maintains performance where computationally feasible. Taken together, our bib- liometric, theoretical, and empirical results provide a three-way triangulation of SMOTE’s decline in contemporary imbalanced learning.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Major changes: - Added AUPRC-based empirical validation (Section 6) - Updated abstract and contributions (Section 1.6) - Added 2 new figures (Critical Difference diagram, AUPRC heatmap) - Added Table 7 (Wilcoxon pairwise tests) - Updated all statistical results (Friedman p=0.120) Major changes since last submission: - 6% figure clarified: We now distinguish 24% SMOTE mentions in the full 821-paper corpus from the 6% execution rate in a deliberately scale‑focused top‑50 subset, and we explain this sampling design and its robustness (Section 2.2, 3.1). - Claims now fully supported: All unsourced practitioner/Kaggle statements were removed or weakened; geometric and generative claims now have explicit citations/derivations, and we acknowledge SMOTE‑GPU and its limited practical impact. - Experiments and No Resampling: We explicitly frame the benchmark as tabular+tree‑based, and we discuss why No Resampling ≈ SMOTE at moderate imbalance while resampling dominates for IR > 40:1. - Hardware realism: We stand by our claim of a 20 GB workstation as realistic hardware and supported this with the Steam 2026 survey (16–32 GB RAM covers ~78% of users), contrasting it with 64–256 GB server configs in Section 4.1.
Assigned Action Editor: ~Ju_Sun1
Submission Number: 6827
Loading