PMLBmini: A Tabular Classification Benchmark Suite for Data-Scarce Applications

Ricardo Knauer; Marvin Grimm; Erik Rodner

PMLBmini: A Tabular Classification Benchmark Suite for Data-Scarce Applications

Ricardo Knauer, Marvin Grimm, Erik Rodner

Published: 30 Apr 2024, Last Modified: 02 Sept 2024AutoML 2024 WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: benchmark, data scarcity, AutoML, deep learning, meta-learning

TL;DR: We introduce TabMini, a tabular classification benchmark suite for the low-data regime, and use our suite to evaluate state-of-the-art AutoML and deep learning methods against a logistic regression baseline.

Abstract: In practice, we are often faced with small-sized tabular data. However, current tabular benchmarks are not geared towards data-scarce applications, making it very difficult to derive meaningful conclusions from empirical comparisons. We introduce PMLBmini, a tabular benchmark suite of 44 binary classification datasets with sample sizes ≤ 500. We use our suite to thoroughly evaluate current automated machine learning (AutoML) frameworks, off-the-shelf tabular deep neural networks, as well as classical linear models in the low-data regime. Our analysis reveals that state-of-the-art AutoML and deep learning approaches often fail to appreciably outperform even a simple logistic regression baseline, but we also identify scenarios where AutoML and deep learning methods are indeed reasonable to apply. Our benchmark suite, available on https://github.com/RicardoKnauer/TabMini, allows researchers and practitioners to analyze their own methods and challenge their data efficiency.

Submission Checklist: Yes

Broader Impact Statement: Yes

Paper Availability And License: Yes

Code Of Conduct: Yes

Code And Dataset Supplement: zip

Optional Meta-Data For Green-AutoML: This blue field is just for structuring purposes and cannot be filled.

Community Implementations: https://github.com/RicardoKnauer/TabMini

Submission Number: 4

Loading