PMLBmini: A Tabular Classification Benchmark Suite for Data-Scarce Applications

Published: 30 Apr 2024, Last Modified: 02 Sept 2024AutoML 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmark, data scarcity, AutoML, deep learning, meta-learning
TL;DR: We introduce TabMini, a tabular classification benchmark suite for the low-data regime, and use our suite to evaluate state-of-the-art AutoML and deep learning methods against a logistic regression baseline.
Abstract: In practice, we are often faced with small-sized tabular data. However, current tabular benchmarks are not geared towards data-scarce applications, making it very difficult to derive meaningful conclusions from empirical comparisons. We introduce PMLBmini, a tabular benchmark suite of 44 binary classification datasets with sample sizes ≤ 500. We use our suite to thoroughly evaluate current automated machine learning (AutoML) frameworks, off-the-shelf tabular deep neural networks, as well as classical linear models in the low-data regime. Our analysis reveals that state-of-the-art AutoML and deep learning approaches often fail to appreciably outperform even a simple logistic regression baseline, but we also identify scenarios where AutoML and deep learning methods are indeed reasonable to apply. Our benchmark suite, available on, allows researchers and practitioners to analyze their own methods and challenge their data efficiency.
Submission Checklist: Yes
Broader Impact Statement: Yes
Paper Availability And License: Yes
Code Of Conduct: Yes
Code And Dataset Supplement: zip
Optional Meta-Data For Green-AutoML: This blue field is just for structuring purposes and cannot be filled.
Community Implementations:
Submission Number: 4