Position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead

Published: 30 Apr 2026, Last Modified: 24 Jun 2026ICML 2026 Position Paper Track regularEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We show and argue why human tests do not transfer to AI and propose a process to create valid measurement instruments for AI instead.
Abstract: Large Language Models (LLMs) have achieved remarkable results on a range of standardized tests originally designed to assess human cognitive and psychological traits, such as intelligence and personality. While these results are often interpreted as strong evidence of human-like characteristics in LLMs, this paper argues that such interpretations constitute an ontological error. Human psychological and educational tests are theory-driven measurement instruments, calibrated to a specific population. Applying these tests to non-human subjects without empirical validation, risks mischaracterizing what is being measured. Furthermore, a growing trend frames AI performance on benchmarks as measurements of traits such as ``intelligence'', despite known issues with validity, data contamination, cultural bias and sensitivity to superficial prompt changes. We argue that interpreting benchmark performance as measurements of human-like traits, lacks sufficient theoretical and empirical justification. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead. We call for the development of principled, AI-spec by laying out, end-to-end, how valid measurement instruments are constructed and validated and where the ontological error enters when a human-calibrated instrument is applied to LLMs.
Lay Summary: When AI systems like ChatGPT, Gemini and Claude score well on IQ tests, GRE exams or personality questionnaires, headlines often declare that AI is becoming more human-like. But should we trust those scores at face value? In this paper, we argue: no. Tests designed to measure human intelligence or personality are built with humans in mind — they are carefully calibrated using human data, grounded in theories of human cognition, and validated against human behavior. When we hand those same tests to an AI, it is like putting a heart rate monitor on a robot’s arm. Maybe the measurement device will show a number, but we can not interpret it as the robot’s heart rate or follow that the robot is dead if the device shows a 0. This leads to our position: Stop Evaluating AI with Human Tests, Develop Principled, AI-specific Tests instead — with clear theoretical foundations, rigorous validation, and honest accounting of what is and isn't being measured. We show, end-to-end, how valid measurement instruments are constructed, validated and where the category error enters when a human-calibrated test is applied to AI.
Primary Area: Research Priorities, Methodology, and Evaluation
Keywords: Evaluation, Benchmarking, Measurement, Psychometrics
Originally Submitted PDF: pdf
Submission Number: 359
Loading