Keywords: Benchmark, Tabular Models, Text Embeddings, Feature Selection, Dimensionality Reduction
TL;DR: We introduce a new benchmark for text in tabular data, evaluate embedding techniques, highlight their limitations, and analyze performance.
Abstract: Foundation models for tabular data are rapidly evolving, with increasing interest in extending them to support additional modalities such as free-text features. However, existing benchmarks for tabular data rarely include textual columns, and identifying real-world tabular datasets with semantically rich text features is non-trivial.
We propose a series of simple yet effective ablation-style strategies for incorporating text into conventional tabular pipelines.
Moreover, we benchmark how state-of-the-art tabular foundation models can handle textual data by manually curating a collection of real-world tabular datasets with meaningful textual features.
Our study is an important step towards improving benchmarking of foundation models for tabular data with text.
Submission Number: 103
Loading