Position: Benchmark Method-Comparisons Are Posterior Identifiability Problems

Published: 30 May 2026, Last Modified: 01 Jun 2026SPIGM @ ICML PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: posterior identifiability, benchmark evaluation, Bayesian inference, foundation models, structured probabilistic inference, position paper
TL;DR: Benchmark method-comparisons are posterior identifiability problems. We prove and demonstrate the rank-based phase transition (W=3) plus condition-number sensitivity at fixed rank. 4-item disclosure standard.
Abstract: The dominant evaluation paradigm in foundation-model and machine-learning systems research compares methods by their headline scores on a small handful of benchmarks. We argue that this paradigm is, structurally, a posterior identifiability problem: the observed score gap is a function of multiple latent contributions (algorithmic effect, workload-shape interaction, hardware-implementation interaction), and the posterior over the algorithmic component is identifiable only when the experimental design provides enough linearly independent observations. We formalize the problem as a linear-Gaussian inverse problem with rank-based identifiability and prove a posterior identifiability proposition with a closed-form posterior variance expression. Two empirical experiments demonstrate the implications: a phase-transition experiment showing posterior std on the algorithmic effect contracts from 0.75 (essentially the prior) at one workload to 0.012 at five workloads, with the transition occurring at three workloads (the rank of the parameter space); and a condition-number sweep showing posterior contraction degrades smoothly from 95% (well-conditioned) to 27% (near-singular) at fixed rank. We propose that benchmark-comparison papers report posterior identifiability diagnostics: latent-parameter inventory, design-matrix rank, posterior contraction relative to prior, and posterior credible intervals on the headline gap.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 257
Loading