From Human-Level AI Tales to AI Levelling Human Scales

From Human-Level AI Tales to AI Levelling Human Scales

ICLR 2026 Conference Submission20101 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: AI Evaluation, Psychometrics, Large Language Models (LLMs), Cross-Cultural AI, Benchmark Calibration

TL;DR: Redefining "human-level" AI, we calibrate benchmarks with $L = -\log_{10} p_W$, estimated via LLM from biased samples, validated by slicing/post-stratification, for standard AI capability measures relative to global population.

Abstract: What does ``human-level'' mean when model scores come from heterogeneous benchmarks? Or when the human data comes from a W.E.I.R.D. distribution? Can we place AI on a more comprehensive human-referenced scale? To have insight and progress on this question, we first work with ratio scales using difficulty in $L$-units, from the probability of success of the whole world population $p_W$ on each item. Each level is defined by $L = -\log_{B} p_W$ (so $L=0$ $\approx\$ near-universal success, $L=1$ $\approx$ 1-in-$B$, $L=2$ $\approx$ 1-in-$B^2$, etc.). Then we compile publicly released test items spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, PR, and ReliabilityBench), annotating each with capability scales (reasoning, attention, volume, etc.). The estimation of $B$ and location of anchor questions is done by extrapolating from a biased source sample (characterized by its demographics and other known information of how it was obtained) towards a larger target population (with a new demographic profile) using LLMs, with the hypothesis that they condense vast amounts of demographic data during their training. We explore different prompting mechanisms and ways to specify source and target distributions and evaluate their quality using group slicing on some of the datasets and post-stratification. The techniques introduced here allow for the definition of calibrated scales from which we can standardize AI measurements relative to the world population, and scalable `equating' of human populations in the social sciences.

Primary Area: interpretability and explainable AI

Submission Number: 20101

Loading