Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 Position Paper Track posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: We argue that evaluating GenAI systems is a social science measurement challenge. As a result, the ML community would benefit from learning from and drawing on the social sciences when evaluating GenAI systems.
Abstract: The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been described as "a tangle of sloppy tests [and] apples-to-oranges comparisons" (Roose, 2024). In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating GenAI systems. Specifically, our position is that evaluating GenAI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI systems. This framework has two important implications: First, it can broaden the expertise involved in evaluating GenAI systems by enabling stakeholders with different perspectives to participate in conceptual debates. Second, it brings rigor to both conceptual and operational debates by offering a set of lenses for interrogating validity.
Lay Summary: We've all seen news headlines claiming that GenAI systems can diagnose illnesses, solve difficult math problems, and write code. We've also seen coverage of risks like memorizing copyrighted data and generating harmful content. But what's the evidence behind these claims? And should we trust it? Much of the evidence comes from ``GenAI evaluations.’’ However, current evaluation practices lack sufficient scientific rigor. A key challenge is that the concepts of interest—like "diagnostic ability," "memorization," and "harmful content"—are much more abstract than the concepts—like prediction accuracy—that underpinned ML evaluations before the GenAI era. Indeed, these new concepts are much more reminiscent of the abstract, contested concepts studied in the social sciences, such as democracy in political science and personality traits in psychometrics. We describe how adopting a variant of the framework that social scientists use for measuring abstract, contested concepts can improve the scientific rigor of GenAI evaluations. A key part of this framework is clearly defining *what* will be measured and *why* separately from implementation decisions about *how* it will be measured. Separating the what and the why from the how allows us to meaningfully interrogate the validity of evaluations. This allows us to spot, for example, when a concept like memorization is defined in a way that is misaligned with the definitions most relevant to assessing copyright infringement, or when two benchmarks for measuring the stereotyping behaviors of LLMs implement very different understandings of those behaviors. The framework we propose makes it easier to identify and avoid poor evaluations, thereby forming an important step toward maturing current evaluation practices into a rigorous science of GenAI evaluations.
Primary Area: Research Priorities, Methodology, and Evaluation
Keywords: Generative AI, Capabilities, Behaviors, Impacts, Evaluation, Measurement, Measurement Theory, Social Sciences, Validity
Submission Number: 183
Loading