Measurement to Meaning: A Validity-Centered Framework for AI Evaluation

Olawale Elijah Salaudeen; Anka Reuel; Ahmed M Ahmed; Suhana Bedi; Zachary Robertson; Sudharsan Sundar; Benjamin W. Domingue; Angelina Wang; Sanmi Koyejo

Measurement to Meaning: A Validity-Centered Framework for AI Evaluation

Olawale Elijah Salaudeen, Anka Reuel, Ahmed M Ahmed, Suhana Bedi, Zachary Robertson, Sudharsan Sundar, Benjamin W. Domingue, Angelina Wang, Sanmi Koyejo

Published: 24 Sept 2025, Last Modified: 26 Nov 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Evaluations, Measurement Theory, Validity

TL;DR: A validity-centered framework for AI evaluation

Abstract: Despite rapid advances in AI, evaluation methods haven't kept pace, leading to grandiose claims about general capabilities being supported by narrow benchmark performances. This creates a misleading assessment of an AI's true capabilities. To address this gap, this paper introduces a structured framework, leveraging principles from measurement theory, to more rigorously connect evaluation evidence to the claims being made. This approach helps reason about whether, for example, strong math performance indicates broad reasoning ability or just math test-taking skill. By scrutinizing the validity claims derived from evaluations, the framework aids in better decision-making and is demonstrated through detailed case studies on vision and language models.

Submission Number: 68

Loading