Graduating the Benchmark Scale: Lessons from Thermometry

Published: 29 Apr 2026, Last Modified: 11 May 2026Eval Eval @ ACL 2026 PosterEveryoneRevisionsCC BY 4.0
Keywords: benchmarks, thermometry, philosophy of science, problem of nomic measurement
TL;DR: The functional relationship between benchmark scores and the underlying capability they claim to measure is unknown; drawing on the history of thermometry, we argue that the severity of this problem depends on one's research goals.
Abstract: Benchmarks for assessing large language model (LLM) capabilities have been criticized for a lack of construct validity. Here, we focus on an often overlooked dimension of a benchmark's validity: namely, the functional mapping between a benchmark's numerical score and the underlying quantity the benchmark purports to measure. What licenses the assumption that equivalent intervals on a scale correspond to equivalent differences in the underlying capability? We argue that this question is not merely theoretical: the form of this mapping (e.g., linear vs. logarithmic vs. exponential) could and should influence decisions about deployment and regulatory policy. Drawing on work from the history and philosophy of science, we discuss an analogous problem in the early history of thermometry termed the problem of nomic measurement, as well as the epistemic practices that enabled scientists to overcome these challenges. We then ask whether a similar process of epistemic iteration can overcome this problem in benchmarking. Despite clear differences between temperature and capabilities as constructs, we argue that some modest success could be achievable in the domain of benchmarking—but that this depends crucially on the clear articulation of a researcher's goals and theoretical commitments.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Type: Provocation
Archival Status: Archival
Submission Number: 48
Loading