PMLBmini: A Tabular Classification Benchmark Suite for Data-Scarce Applications

Published: 30 Apr 2024, Last Modified: 02 Sept 2024AutoML 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark, data scarcity, AutoML, deep learning, meta-learning
TL;DR: We introduce TabMini, a tabular classification benchmark suite for the low-data regime, and use our suite to evaluate state-of-the-art AutoML and deep learning methods against a logistic regression baseline.
Abstract: In practice, we are often faced with small-sized tabular data. However, current tabular benchmarks are not geared towards data-scarce applications, making it very difficult to derive meaningful conclusions from empirical comparisons. We introduce PMLBmini, a tabular benchmark suite of 44 binary classification datasets with sample sizes ≤ 500. We use our suite to thoroughly evaluate current automated machine learning (AutoML) frameworks, off-the-shelf tabular deep neural networks, as well as classical linear models in the low-data regime. Our analysis reveals that state-of-the-art AutoML and deep learning approaches often fail to appreciably outperform even a simple logistic regression baseline, but we also identify scenarios where AutoML and deep learning methods are indeed reasonable to apply. Our benchmark suite, available on https://github.com/RicardoKnauer/TabMini, allows researchers and practitioners to analyze their own methods and challenge their data efficiency.
Submission Checklist: Yes
Broader Impact Statement: Yes
Paper Availability And License: Yes
Code Of Conduct: Yes
Code And Dataset Supplement: zip
Optional Meta-Data For Green-AutoML: This blue field is just for structuring purposes and cannot be filled.
Community Implementations: https://github.com/RicardoKnauer/TabMini
Submission Number: 4
Loading