Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Xingjian Shi; Jonas Mueller; Nick Erickson; Mu Li; Alex Smola

Benchmarking Multimodal AutoML for Tabular Data with Text Fields

Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, Alex Smola

Published: 11 Oct 2021, Last Modified: 04 May 2025NeurIPS 2021 Datasets and Benchmarks Track (Round 2)Readers: Everyone

Keywords: Multimodal AutoML, Text Data, Tabular Data, Natural Language Processing, Supervised Learning

TL;DR: We present a new benchmark for classification/regression with data tables that jointly contain numeric, categorical, and text features, as well as a systematic evaluation of various text/tabular modeling strategies over this benchmark

Abstract: We consider the use of automated supervised learning systems for data tables that not only contain numeric/categorical columns, but one or more text fields as well. Here we assemble 18 multimodal data tables that each contain some text fields and stem from a real business application. Our publicly-available benchmark enables researchers to comprehensively evaluate their own methods for supervised learning with numeric, categorical, and text features. To ensure that any single modeling strategy which performs well over all 18 datasets will serve as a practical foundation for multimodal text/tabular AutoML, the diverse datasets in our benchmark vary greatly in: sample size, problem types (a mix of classification and regression tasks), number of features (with the number of text columns ranging from 1 to 28 between datasets), as well as how the predictive signal is decomposed between text vs. numeric/categorical features (and predictive interactions thereof). Over this benchmark, we evaluate various straightforward pipelines to model such data, including standard two-stage approaches where NLP is used to featurize the text such that AutoML for tabular data can then be applied. Compared with human data science teams, the fully automated methodology that performed best on our benchmark also manages to rank 1st place when fit to the raw text/tabular data in two MachineHack prediction competitions and 2nd place (out of 2380 teams) in Kaggle's Mercari Price Suggestion Challenge.

Supplementary Material: pdf

URL: https://github.com/sxjscience/automl_multimodal_benchmark

Contribution Process Agreement: Yes

Dataset Url: https://github.com/sxjscience/automl_multimodal_benchmark

Author Statement: Yes

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 2 code implementations](https://www.catalyzex.com/paper/benchmarking-multimodal-automl-for-tabular/code)

17 Replies

Loading