Abstract: Tabular data represent one of the most prevalent data formats in applied machine learning, largely because they accommodate a broad spectrum of real-world problems.
Existing literature has studied many of the shortcomings of neural architectures on tabular data and has repeatedly confirmed the scalability and robustness of gradient-boosted decision trees across varied datasets. However, recent deep learning models have not been subjected to a comprehensive evaluation under conditions that allow for a fair comparison with existing classical approaches. This situation motivates an investigation into whether recent deep-learning paradigms outperform classical ML methods on tabular data. Our survey fills this gap by benchmarking twenty state-of-the-art methods, spanning neural networks, classical ML and AutoML techniques. Our empirical results over 68 diverse classification datasets from a well-established benchmark indicate a paradigm shift, where Deep Learning methods outperform classical approaches.
Submission Type: Regular submission (no more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=GwxgT12Lte&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: Since our last submission, we have made the following changes:
- Added new state-of-the-art tabular foundation models:
- TabPFNv2, TabICL, Mitra, and LimiX
- Added LightGBM as a strong GBDT baseline
- Added TabM, a strong MLP-based model that incorporates efficient batch ensembling
- Added ModernNCA as a representative method from the knowledge-retrieval paradigm
Overall, we increased the number of baselines from 13 to 20.
In addition, we introduced a more fine-grained pairwise (1v1) comparison by including a win-rate dueling matrix in Figure 4.
A central insight of our work is that refitting improves performance and can change the overall ranking of methods relative to evaluations without refitting. To better support this point, we expanded the refitting experiments by adding two additional baselines: TabM and XGBoost. Compared to the previous submission, where we only examined whether refitting helps each method in isolation, we now also analyze how refitting changes the relative ranking between methods. We show that method rankings do change under refitting, which further strengthens our claim that refitting is both important and underexplored in prior work. We also show that these differences are statistically significant.
Moreover, we complement the rank-based analysis with absolute performance differences in Appendix F.6. Specifically, we group datasets into three families based on the number of instances: small ($n \leq 1000$), medium ($1000 < n \leq 10000$), and large ($n > 10000$). For each dataset family, we report the median and mean $\Delta$ROC-AUC of each method relative to the best GBDT on that dataset. We believe this analysis is especially useful for practitioners, as it provides a more interpretable view of the practical performance gaps.
Assigned Action Editor: ~Philip_K._Chan1
Submission Number: 7681
Loading