Are Tabular Foundation Model Rankings Reliable? A Generalizability Theory Analysis of RelBench and DBInfer

Published: 25 May 2026, Last Modified: 29 May 2026FMSD @ ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmarking, Relational foundational model, Relbench, G-theory, DBInfer, Tabular Foundational model, Ranking reliability
Abstract: Tabular and relational learning foundation models are often evaluated on multi-task relational benchmarks such as RelBench and DBInfer, where models are ranked by aggregated task-level performance. However, how reliable are these rankings? We apply Generalizability Theory (G-theory), a variance decomposition framework from measurement science, to quantify ranking reliability across 14 datasets (9 RelBench + 5 DBInfer), 48 tasks, and up to 35 models from 11 families. We decompose score variance into model differences (signal), task effects, model×task interaction, and sampling error. Our analysis yields three main findings: (1) only 2 of 14 datasets achieve Eρ2 > 0.80, the psychometric threshold for reliable measurement; (2) Decision-study (D-study) simulations show that 80–99% of test items can be removed while maintaining ranking stability (ρ > 0.90), revealing massive redundancy; (3) Model×task interaction, which indicates task-dependent model behavior, is substantial across several datasets and dominates in some cases, reaching 81.2% of variance on rel-stack. Collectively, these findings suggest that single number leaderboard rankings of tabular and relational learning foundation models may provide an unstable estimate of relative model performance.
Submission Number: 128
Loading