Generalization Can Emerge in Tabular Foundation Models From a Single Table

Published: 18 Nov 2025, Last Modified: 18 Nov 2025AITD@EurIPS 2025 OralEveryoneRevisionsBibTeXCC BY 4.0
Submission Type: Short paper (4 pages)
Keywords: tabular data, in-context learning, pre-training, investigation, self-supervised learning, foundation model
TL;DR: Tabular ICL models trained on single dataset yields surprisingly good result, we investigate and found the number of features and the number of unique tasks are the most important determinant.
Abstract: Deep tabular modelling increasingly relies on in-context learning where, during inference, a model receives a set of $(x,y)$ pairs as context and predicts labels for new inputs without weight updates. We challenge the prevailing view that broad generalization here requires pre-training on large synthetic corpora (e.g., TabPFN priors) or a large collection of real data (e.g., TabDPT training datasets), discovering that a relatively small amount of data suffices for generalization. We find that simple self-supervised pre-training on just a single real table can produce surprisingly strong transfer across heterogeneous benchmarks. By systematically pre-training and evaluating on many diverse datasets, we analyze what aspects of the data are most important for building a Tabular Foundation Model (TFM) generalizing across domains. We then connect this to the pre-training procedure shared by most TFMs and show that the number and quality of \emph{tasks} one can construct from a dataset is key to downstream performance.
Submission Number: 11
Loading