What Does Your Benchmark Really Measure? A Framework for Robust Inference of AI Capabilities

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Statistical Inference, AI Ability, Benchmarks, Robust Inference, Item Response Theory, Robustness
TL;DR: AI evals on benchmark data should be grounded on theories of ability and sound statistical inference.
Abstract: Evaluations of generative models on benchmark data are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet skepticism about their reliability continues to grow. How can we know that a reported accuracy genuinely reflects a model’s underlying performance? Although benchmark results are often presented as direct measurements, in practice they are inferences: treating a score as evidence of capability already presupposes a theory of what capability is and how it manifests in a testing environment. We formalize this observation by proposing a principled framework that treats evaluation as inference: first, articulate a theory of capability, and then derive estimators that target this quantity. This perspective is well established in fields such as psychometrics but remains underdeveloped in AI evaluation, where implicit assumptions often go unexamined. As a proof of concept, we apply our framework to a concrete challenge that undermines reliability: model sensitivity to perturbations. We introduce several capability models and show how various sources of uncertainty (e.g., from finite samples and perturbations) arise within these models as nuisance terms of the latent capability itself. We then use standard tools to derive methods that infer capability while accounting for these sources of uncertainty. Our results illustrate how a capability-centered clarifies what evaluations measure and how to adjust for known sources of unreliability. More broadly, our framework yields evaluations that are transparent, grounded on cognitive theory, and better aligned with the scientific claims they aim to support.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 13593
Loading