Position: AI Evaluation Should Learn from How We Test Humans

Yan Zhuang; Qi Liu; Zachary Pardos; Patrick C. Kyllonen; Jiyun Zu; Zhenya Huang; Shijin Wang; Enhong Chen

Position: AI Evaluation Should Learn from How We Test Humans

Yan Zhuang, Qi Liu, Zachary Pardos, Patrick C. Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, Enhong Chen

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 Position Paper Track posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: This position paper argues that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations

Abstract: As AI systems continue to evolve, their rigorous evaluation becomes crucial for their development and deployment. Researchers have constructed various large-scale benchmarks to determine their capabilities, typically against a gold-standard test set and report metrics averaged across all items. However, this static evaluation paradigm increasingly shows its limitations, including high evaluation costs, data contamination, and the impact of low-quality or erroneous items on evaluation reliability and efficiency. In this Position, drawing from human psychometrics, we discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics or value of each test item in the benchmark, and tailoring each model's evaluation instead of relying on a fixed test set. This paradigm provides robust ability estimation, uncovering the latent traits underlying a model’s observed scores. This position paper analyze the current possibilities, prospects, and reasons for adopting psychometrics in AI evaluation. We argue that *psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations*.

Lay Summary: As AI systems, especially LLMs, continue to advance, rigorous and trustworthy evaluation becomes increasingly critical for their development and deployment. Today, AI is typically assessed using large collections of test questions called benchmarks, and each model is scored based on the average of all its answers. But this paradigm has growing problems: it's expensive, time-consuming, and often includes poor-quality or redundant questions. These issues can distort the results and make it harder to trust what the scores really mean. In this paper, we propose a new approach inspired by how human abilities are measured: adaptive testing from the field of psychometrics. Instead of treating every question equally, we estimate how useful or difficult each one is and dynamically adjust the test for each AI system. This leads to a fairer, faster, and more accurate way to measure an AI’s true capabilities. We argue that this shift in evaluation paradigm will not only save time and resources, but also give us deeper insights into what modern AI systems can and can't do.

Verify Author Names: My co-authors have confirmed that their names are spelled correctly both on OpenReview and in the camera-ready PDF. (If needed, please update ‘Preferred Name’ in OpenReview to match the PDF.)

No Additional Revisions: I understand that after the May 29 deadline, the camera-ready submission cannot be revised before the conference. I have verified with all authors that they approve of this version.

Pdf Appendices: My camera-ready PDF file contains both the main text (not exceeding the page limits) and all appendices that I wish to include. I understand that any other supplementary material (e.g., separate files previously uploaded to OpenReview) will not be visible in the PMLR proceedings.

Latest Style File: I have compiled the camera ready paper with the latest ICML2025 style files <https://media.icml.cc/Conferences/ICML2025/Styles/icml2025.zip> and the compiled PDF includes an unnumbered Impact Statement section.

Paper Verification Code: NGE2M

Permissions Form: pdf

Primary Area: Research Priorities, Methodology, and Evaluation

Keywords: AI Evaluation, Adaptive Testing, Benchmark, Psychometrics

Submission Number: 43

Loading