Are Tabular Foundation Model Rankings Reliable? A Generalizability Theory Analysis of RelBench and DBInfer
Keywords: Benchmarking, Relational foundational model, Relbench, G-theory, DBInfer, Tabular Foundational model, Ranking reliability
Abstract: Tabular and relational learning foundation models
are often evaluated on multi-task relational benchmarks such as RelBench and DBInfer, where models are ranked by aggregated task-level performance. However, how reliable are these rankings?
We apply Generalizability Theory (G-theory), a
variance decomposition framework from measurement science, to quantify ranking reliability
across 14 datasets (9 RelBench + 5 DBInfer),
48 tasks, and up to 35 models from 11 families. We decompose score variance into model
differences (signal), task effects, model×task interaction, and sampling error. Our analysis yields
three main findings: (1) only 2 of 14 datasets
achieve Eρ2 > 0.80, the psychometric threshold for reliable measurement; (2) Decision-study
(D-study) simulations show that 80–99% of test
items can be removed while maintaining ranking
stability (ρ > 0.90), revealing massive redundancy; (3) Model×task interaction, which indicates task-dependent model behavior, is substantial across several datasets and dominates in some
cases, reaching 81.2% of variance on rel-stack.
Collectively, these findings suggest that single number leaderboard rankings of tabular and relational learning foundation models may provide an unstable estimate of relative model performance.
Submission Number: 128
Loading