Beyond Mean Scores: Factor Models for Reliable and Efficient AI Evaluation

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI evaluation, Benchmarking, factor models, large language models
TL;DR: Mean scores can obscure model abilities. Factor models reveal distinct capability clusters and enable more reliable and efficient LLM evaluation.
Abstract: Generative AI evaluation relies heavily on benchmark leaderboards, which rank models according to their mean score on a given benchmark. In this paper, we show that these one-number metrics often obscure the multidimensional structure of model capabilities. We propose a factor model approach that decomposes model-item performance into interpretable latent constructs. We apply the modeling approach to examine a novel data set constructed from the Huggingface Open LLM Leaderboard containing item responses from 4,416 language models evaluated across 21,176 questions from six benchmarks. Our analysis reveals two key findings. (i) First, benchmarks contain distinct, sometimes negatively correlated constructs that mean scores conflate---models with identical averages can excel at entirely different capabilities. This makes mean scores uninformative---or even misleading---measures of model capabilities. We propose disaggregated alternatives based on the factor structure. (ii) Second, we demonstrate that the factor structure enables efficient estimation of full-benchmark and disaggregated factor-level mean scores. By identifying the most informative questions, we can reduce evaluation costs while preserving model rankings. These results establish factor models as a principled framework for understanding benchmark structure, diagnosing when aggregation obscures meaningful differences, and enabling adaptive evaluation that maximizes information per question.
Primary Area: datasets and benchmarks
Submission Number: 23180
Loading