Keywords: benchmarking, tabular deep learning, dataset selection, OpenML
Abstract: Data in tabular form makes up a large part of real-world ML applications, and thus, there has been a strong interest in developing novel deep learning (DL) architectures for supervised learning on tabular data in recent years. As a result, there is a debate as to whether DL methods are superior to the ubiquitous ensembles of boosted decision trees. Very often, the advantage of one model class over the other is claimed based on an empirical evaluation, where different variations of both model classes are compared on a set of benchmark datasets that supposedly resemble relevant real-world tabular data. While the landscape of state-of-the-art models for tabular data changed, one factor has remained largely constant over the years: The datasets. Here, we examine $30$ recent publications using a total of $187$ different datasets in terms of age, study size and relevance. We found that the average study used less than $10$ datasets and that $50$\% of the datasets are older than a current first-year student (born in 1994). Our insights raise questions about the conclusions drawn from previous studies and urge the research community to develop and publish recent, challenging and relevant datasets and ML tasks for supervised learning on tabular data.
Primary Subject Area: Optimal data for standard evaluation framework in the context of changing model landscape
Paper Type: Research paper: up to 8 pages
Participation Mode: In-person
Confirmation: I have read and agree with the workshop's policy on behalf of myself and my co-authors.
Submission Number: 61
Loading