Keywords: tabular foundation models, ensembling, post-hoc ensembling, benchmarking, calibration, diversity
TL;DR: Six modern TFMs share an ICL-on-synthetic-priors recipe (Q=0.961), so ensembling six different TFMs buys +0.18% accuracy at 253× the compute and a calibration trap.
Abstract: Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but **no single TFM wins on every dataset**. Ensembling is the textbook fix, and it works less well than expected. Six modern TFMs form a near-redundant pool: their **mean pairwise Q-statistic is 0.961**, close enough to 1 that any convex combination is bounded above. We benchmark **six ensemble strategies over six TFMs on 153 OpenML classification tasks**. The best ensemble, two-level cascade stacking, buys **+0.18% accuracy over the strongest single TFM** at 253× the compute . A Friedman–Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly worse than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the worst log-loss rank among the ensembles. The meta-learner improves accuracy by sharpening class boundaries, which destroys calibration. We recommend **greedy selection as the practical default**.
Submission Number: 170
Loading