Large-Scale Pretraining Offers Modest Benefits for Tabular Transfer Learning

Large-Scale Pretraining Offers Modest Benefits for Tabular Transfer Learning

ICLR 2026 Conference Submission21453 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: tabular foundation models, tabular transfer learning, large-scale pretraining

Abstract: Several recent works seek to train foundation models for tabular prediction by pretraining neural networks on large collections of tabular classification and regression datasets. These tabular foundation models (TFMs) are often reported to outperform non-pretrained baselines when applied to predictive tasks on unseen tables, demonstrating effective tabular transfer learning. In this paper, we show that, in contrast to the positive conclusions of prior works, the perceived performance benefits from large-scale tabular pretraining largely diminish when we aggregate the results across datasets while (i) preserving the performance differences between models in their original scale (e.g., without min-max normalization); and (ii) testing for the statistical significance of these differences. For example, when we replicate the original evaluation setup for TabPFN-v2 on classification tasks, TabPFN-v2 indeed achieves the highest average min-max normalized AUROC, but reaches a statistical tie with CatBoost in 69% of all datasets, while significantly outperforming it in 20.7% of datasets and underperforming it in the remaining 10.3% of datasets. We evaluate seven open-source TFMs on 88 classification and 82 regression datasets in both full-data (i.e., using all training examples) and few-shot settings, and find that existing TFMs only show statistically significant improvements over non-pretrained baselines on small classification datasets, with no consistent gains in other settings. To isolate the impact of tabular pretraining, we also compare three TFMs directly to their non-pretrained counterparts, and find that, in most cases, the performance gains from pretraining are minimal. Our findings suggest that, unlike in vision and language, simply scaling pretraining over a diverse collection of tabular datasets may offer limited performance benefits. To support reproducible research and enable standardized evaluation of TFMs, we release our evaluation suite as the TFM Evaluation Harness.

Primary Area: datasets and benchmarks

Submission Number: 21453

Loading