Measuring the Ruler: Reading Benchmark Saturation as Evidence

Published: 04 Jun 2026, Last Modified: 12 Jun 2026PhilML@ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: benchmarking, evaluation, construct validity, language models, measurement, benchmark saturation, robustness
Abstract: We consider the inference from a benchmark score to a claim about a system's capabilities, and argue that this inference is conditional on the system being evaluated. A benchmark supports a claim about a target capability, such as reasoning or programming competence, only when the benchmark-system pair has been validated. The same benchmark can therefore support different inferences for systems with different optimisation histories. Under this view, benchmark saturation is not just a sign that harder tests are needed: it is evidence about the validity relation itself. We illustrate this on MMLU, GSM8K, and HumanEval, and propose a short Validity Transfer Report which benchmark papers can use to make the relevant assumptions explicit.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 62
Loading