Position: Human Baselines in Model Evaluations Need Rigor and Transparency (With Recommendations & Reporting Checklist)
TL;DR: We propose recommendations to improve human baselining methods in foundation model evaluations, and we systematically review 115 human baselines to identify where current evaluations fall short.
Abstract: **In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end.** Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: [https://github.com/kevinlwei/human-baselines](https://github.com/kevinlwei/human-baselines).
Lay Summary: Advanced AI systems are more and more able to perform complex, realistic, and profitable tasks. How can we meaningfully figure out if AI systems can perform these tasks as well as humans—and how much better or worse are they? We looked at how other disciplines like psychology, economics, and political science measure differences between groups of humans, and based on what these disciplines do, we wrote guidelines for comparing AI and human performance. We then looked at AI studies that made human vs. AI performance comparisons, and we found that most comparisons aren't very trustworthy. For instance, many studies don't compare AI systems with enough humans, or they are really comparing humans and AIs on different tasks under the hood.
This research will help improve our understanding of what AIs can do when compared to what humans can do. That understanding is important not just to AI researchers, but also to companies and users who want to know where AI excels and fails, as well as to policymakers thinking about how AI can be dangerous or how AI can affect jobs. We hope that our research can lead to better AI research, AI usage, and AI policy.
Verify Author Names: My co-authors have confirmed that their names are spelled correctly both on OpenReview and in the camera-ready PDF. (If needed, please update ‘Preferred Name’ in OpenReview to match the PDF.)
No Additional Revisions: I understand that after the May 29 deadline, the camera-ready submission cannot be revised before the conference. I have verified with all authors that they approve of this version.
Pdf Appendices: My camera-ready PDF file contains both the main text (not exceeding the page limits) and all appendices that I wish to include. I understand that any other supplementary material (e.g., separate files previously uploaded to OpenReview) will not be visible in the PMLR proceedings.
Latest Style File: I have compiled the camera ready paper with the latest ICML2025 style files <https://media.icml.cc/Conferences/ICML2025/Styles/icml2025.zip> and the compiled PDF includes an unnumbered Impact Statement section.
Paper Verification Code: Y2MwM
Link To Code: https://github.com/kevinlwei/human-baselines
Permissions Form: pdf
Primary Area: Research Priorities, Methodology, and Evaluation
Keywords: human baseline, human performance, human performance baseline, science of evaluations, AI evaluation, model evaluation, LLM evaluation, evaluation methodology, language model, foundation model
Submission Number: 59
Loading