Keywords: Benchmark, Tabular Models, Text Embeddings, Feature Selection, Dimensionality Reduction
TL;DR: We curate a high-quality pool of datasets for learning text representation in tabular data. We benchmark and evaluate off the shelf text-embedding and downsampling strategies for optimal performance on tabular predictions.
Abstract: Foundation models for tabular data are rapidly evolving, with increasing interest in extending them to support additional modalities such as free-text features. However, existing benchmarks for tabular data rarely include textual columns, and identifying real-world tabular datasets with semantically rich text features is non-trivial.
We propose a series of simple yet effective ablation-style strategies for incorporating text into conventional tabular pipelines.
Moreover, we benchmark how state-of-the-art tabular foundation models can handle textual data by manually curating a collection of real-world tabular datasets with meaningful textual features.
Our study is an important step towards improving benchmarking of foundation models for tabular data with text.
Submission Number: 103
Loading