Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge

Hanna Wallach; Meera Desai; A. Feder Cooper; Angelina Wang; Chad Atalla; Solon Barocas; Su Lin Blodgett; Alexandra Chouldechova; Emily Corvi; P. Alex Dow; Jean Garcia-Gathright; Alexandra Olteanu; Nicholas J Pangakis; Stefanie Reed; Emily Sheng; Dan Vann; Jennifer Wortman Vaughan; Matthew Vogel; Hannah Washington; Abigail Z. Jacobs

Position: Evaluating Generative AI Systems Is a Social Science Measurement Challenge

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 Position Paper Track posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We argue that evaluating GenAI systems is a social science measurement challenge. As a result, the ML community would benefit from learning from and drawing on the social sciences when evaluating GenAI systems.

Abstract: The measurement tasks involved in evaluating generative AI (GenAI) systems lack sufficient scientific rigor, leading to what has been described as "a tangle of sloppy tests [and] apples-to-oranges comparisons" (Roose, 2024). In this position paper, we argue that the ML community would benefit from learning from and drawing on the social sciences when developing and using measurement instruments for evaluating GenAI systems. Specifically, our position is that evaluating GenAI systems is a social science measurement challenge. We present a four-level framework, grounded in measurement theory from the social sciences, for measuring concepts related to the capabilities, behaviors, and impacts of GenAI systems. This framework has two important implications: First, it can broaden the expertise involved in evaluating GenAI systems by enabling stakeholders with different perspectives to participate in conceptual debates. Second, it brings rigor to both conceptual and operational debates by offering a set of lenses for interrogating validity.

Lay Summary: We've all seen news headlines claiming that GenAI systems can diagnose illnesses, solve difficult math problems, and write code. We've also seen coverage of risks like memorizing copyrighted data and generating harmful content. But what's the evidence behind these claims? And should we trust it? Much of the evidence comes from ``GenAI evaluations.’’ However, current evaluation practices lack sufficient scientific rigor. A key challenge is that the concepts of interest—like "diagnostic ability," "memorization," and "harmful content"—are much more abstract than the concepts—like prediction accuracy—that underpinned ML evaluations before the GenAI era. Indeed, these new concepts are much more reminiscent of the abstract, contested concepts studied in the social sciences, such as democracy in political science and personality traits in psychometrics. We describe how adopting a variant of the framework that social scientists use for measuring abstract, contested concepts can improve the scientific rigor of GenAI evaluations. A key part of this framework is clearly defining *what* will be measured and *why* separately from implementation decisions about *how* it will be measured. Separating the what and the why from the how allows us to meaningfully interrogate the validity of evaluations. This allows us to spot, for example, when a concept like memorization is defined in a way that is misaligned with the definitions most relevant to assessing copyright infringement, or when two benchmarks for measuring the stereotyping behaviors of LLMs implement very different understandings of those behaviors. The framework we propose makes it easier to identify and avoid poor evaluations, thereby forming an important step toward maturing current evaluation practices into a rigorous science of GenAI evaluations.

Verify Author Names: My co-authors have confirmed that their names are spelled correctly both on OpenReview and in the camera-ready PDF. (If needed, please update ‘Preferred Name’ in OpenReview to match the PDF.)

No Additional Revisions: I understand that after the May 29 deadline, the camera-ready submission cannot be revised before the conference. I have verified with all authors that they approve of this version.

Pdf Appendices: My camera-ready PDF file contains both the main text (not exceeding the page limits) and all appendices that I wish to include. I understand that any other supplementary material (e.g., separate files previously uploaded to OpenReview) will not be visible in the PMLR proceedings.

Latest Style File: I have compiled the camera ready paper with the latest ICML2025 style files <https://media.icml.cc/Conferences/ICML2025/Styles/icml2025.zip> and the compiled PDF includes an unnumbered Impact Statement section.

Paper Verification Code: YzRiZ

Permissions Form: pdf

Primary Area: Research Priorities, Methodology, and Evaluation

Keywords: Generative AI, Capabilities, Behaviors, Impacts, Evaluation, Measurement, Measurement Theory, Social Sciences, Validity

Submission Number: 183

Loading