MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

Published: 25 May 2026, Last Modified: 29 May 2026FMSD @ ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multimodal tabular learning; tabular data; tabular foundation model; deep learning; benchmark; dataset curation; image-tabular; text-tabular;
TL;DR: We introduce MulTaBench, a benchmark identifying the fundamental need for task-specific adaptation over frozen representations in Multimodal Tabular Learning.
Abstract: Tabular Foundation Models have recently established the state of the art in supervised tabular learning. However, they lack native support for unstructured modalities such as text and image, and rely on frozen, pretrained embeddings to process them. We show that tuning the embeddings to the task improves performance on established Multimodal Tabular benchmarks. We introduce MulTaBench, a benchmark of 40 datasets, split equally between image-tabular and text-tabular tasks. We focus on predictive tasks where the modalities provide complementary predictive signal, and where generic embeddings lose critical information, necessitating Target-Aware Representations that are aligned with the task. We demonstrate that the gains from target-aware representation tuning generalize across both text and image modalities, several tabular learners, encoder scales, and embedding dimensions. MulTaBench constitutes the largest image-tabular benchmarking effort to date, enabling the research of novel architectures which incorporate target-aware representations, paving the way for the development of novel Multimodal Tabular Foundation Models.
Submission Number: 135
Loading