Model Evaluations Need Rigorous and Transparent Human Baselines

Published: 05 Mar 2025, Last Modified: 15 Apr 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0
Track: Long Paper Track (up to 9 pages)
Keywords: human baseline, human performance, human performance baseline, science of evaluations, AI evaluation, model evaluation, LLM evaluation, evaluation methodology, language model, foundation model
TL;DR: We propose an instructional framework to improve human baselining methods in foundation model evaluations, and we systematically review 113 human baselines to identify where current evaluations fall short.
Abstract: **This position paper argues that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance.** Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework for assessing human baselining methods. We then use our framework to systematically review 113 human baselines (studies) in foundation model evaluations, identifying shortcomings in existing baselining methods. We publish our framework as a reporting checklist for researchers conducting human baseline studies. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers.
Submission Number: 46
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview