Keywords: AI Evaluation, Psychometrics, Large Language Models (LLMs), Cross-Cultural AI, Benchmark Calibration
TL;DR: Redefining "human-level" AI, we calibrate benchmarks with $L = -\log_{10} p_W$, estimated via LLM from biased samples, validated by slicing/post-stratification, for standard AI capability measures relative to global population.
Abstract: What does ``human-level'' mean when model scores come from heterogeneous benchmarks? Or when the human data comes from a W.E.I.R.D. distribution?
Can we place AI on a more comprehensive human-referenced scale?
To have insight and progress on this question, we first work with ratio scales using difficulty in $L$-units, from the probability of success of the whole world population $p_W$ on each item.
Each level is defined by $L = -\log_{B} p_W$ (so $L=0$ $\approx\$ near-universal success, $L=1$ $\approx$ 1-in-$B$, $L=2$ $\approx$ 1-in-$B^2$, etc.).
Then we compile publicly released test items spanning education and reasoning benchmarks (PISA, TIMSS, ICAR, PR, and ReliabilityBench), annotating each with capability scales (reasoning, attention, volume, etc.).
The estimation of $B$ and location of anchor questions is done by extrapolating from a biased source sample (characterized by its demographics and other known information of how it was obtained) towards a larger target population (with a new demographic profile) using LLMs, with the hypothesis that they condense vast amounts of demographic data during their training.
We explore different prompting mechanisms and ways to specify source and target distributions and evaluate their quality using group slicing on some of the datasets and post-stratification.
The techniques introduced here allow for the definition of calibrated scales from which we can standardize AI measurements relative to the world population, and scalable `equating' of human populations in the social sciences.
Primary Area: interpretability and explainable AI
Submission Number: 20101
Loading