What You Pretrain On Matters: Synthetic Task Distributions Determine Tabular Foundation Model Quality

Published: 25 May 2026, Last Modified: 29 May 2026FMSD @ ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Tabular Foundation Models, Synthetic Prior Design, In-Context Learning, Structural Causal Models
Abstract: Tabular foundation models acquire their inductive biases almost entirely from synthetic pretraining distributions, yet prior design remains poorly understood. Standard synthetic priors are too well-behaved: they omit the confounding, structured missingness, distributional shift, and spurious support-query correlations that characterize real tabular data. We introduce O'Prior, a compositional realism prior built around four coupled components: a hierarchical SCM meta-generator spanning diverse functional families; a modular realism engine covering heterogeneous marginals, missingness, and target transforms; an explicit stress module injecting confounding and support-query mismatch; and a curriculum-governed, leakage-safe generation protocol. Holding architecture, optimizer, and compute budget fixed, O'Prior yields consistent and substantial improvements across real tabular benchmarks. Ablations confirm that mechanism diversity, realism composition, and shift-aware stress each contribute independently, establishing synthetic prior construction as a first-order and largely overlooked determinant of tabular foundation model quality.
Submission Number: 184
Loading