Measurement to Meaning: A Validity-Centered Framework for AI Evaluation

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Evaluations, Measurement Theory, Validity
TL;DR: A validity-centered framework for AI evaluation
Abstract: Despite rapid advances in AI, evaluation methods haven't kept pace, leading to grandiose claims about general capabilities being supported by narrow benchmark performances. This creates a misleading assessment of an AI's true capabilities. To address this gap, this paper introduces a structured framework, leveraging principles from measurement theory, to more rigorously connect evaluation evidence to the claims being made. This approach helps reason about whether, for example, strong math performance indicates broad reasoning ability or just math test-taking skill. By scrutinizing the validity claims derived from evaluations, the framework aids in better decision-making and is demonstrated through detailed case studies on vision and language models.
Submission Number: 68
Loading