Position: Medical Large Language Model Benchmarks Should Prioritize Construct Validity

Ahmed Alaa; Thomas Hartvigsen; Niloufar Golchini; Shiladitya Dutta; Frances Dean; Inioluwa Deborah Raji; Travis Zack

Position: Medical Large Language Model Benchmarks Should Prioritize Construct Validity

Ahmed Alaa, Thomas Hartvigsen, Niloufar Golchini, Shiladitya Dutta, Frances Dean, Inioluwa Deborah Raji, Travis Zack

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 Position Paper Track oralEveryoneRevisionsBibTeXCC BY 4.0

Abstract:

Medical large language models (LLMs) research often makes bold claims, from encoding clinical knowledge to reasoning like a physician. These claims are usually backed by evaluation on competitive benchmarks—a tradition inherited from mainstream machine learning. But how do we separate real progress from a leaderboard flex? Medical LLM benchmarks, much like those in other fields, are arbitrarily constructed using medical licensing exam questions. For these benchmarks to truly measure progress, they must accurately capture the real-world tasks they aim to represent. In this position paper, we argue that medical LLM benchmarks should—and indeed can—be empirically evaluated for their construct validity. In the psychological testing literature, “construct validity” refers to the ability of a test to measure an underlying “construct”, that is the actual conceptual target of evaluation. By drawing an analogy between LLM benchmarks and psychological tests, we explain how frameworks from this field can provide empirical foundations for validating benchmarks. To put these ideas into practice, we use real-world clinical data in proof-of-concept experiments to evaluate popular medical LLM benchmarks and report significant gaps in their construct validity. Finally, we outline a vision for a new ecosystem of medical LLM evaluation centered around the creation of valid benchmarks.

Lay Summary:

Many studies claim that medical large language models (LLMs) are highly capable---often based on how well they do on multiple-choice, exam-style tests. In this paper, we argue that those test scores don't truly reflect the messy, complicated reality of taking care of real patients in real hospitals. This problem isn't limited to just medical LLMs. In general, we tend to treat these advanced models as if they're intelligent "agents" that can manifest some latent "capabilities" in open-ended tasks. Yet, we still test them the same way we test simpler and narrower models, such as image classifiers. We draw parallels between the "capabilities" of LLMs and psychological traits such as intelligence---both are latent and complex constructs that cannot be directly observed but manifests in multifaceted ways through the ability to conduct certain tasks. Based on this analogy, we suggest borrowing a concept from psychology known as "construct validity"---the idea that a test should actually measure the skill it claims to---as a foundational principle to evaluate and design benchmarks for LLMs. We applied empirical tools for evaluating construct validity inspired by the psychometrics literature to medical LLM benchmarks, and found that even models with top scores on popular benchmarks often didn’t do well when working with real patient records. We propose a vision for a "benchmark-validation-first" culture for model evaluation, where make sure the construct validity of benchmarks are evaluated using real hospital data before using them to judge model quality. That way, we can evaluate medical LLMs based on what actually matters in clinical care—not just how well it answers test questions.

Verify Author Names: My co-authors have confirmed that their names are spelled correctly both on OpenReview and in the camera-ready PDF. (If needed, please update ‘Preferred Name’ in OpenReview to match the PDF.)

No Additional Revisions: I understand that after the May 29 deadline, the camera-ready submission cannot be revised before the conference. I have verified with all authors that they approve of this version.

Pdf Appendices: My camera-ready PDF file contains both the main text (not exceeding the page limits) and all appendices that I wish to include. I understand that any other supplementary material (e.g., separate files previously uploaded to OpenReview) will not be visible in the PMLR proceedings.

Latest Style File: I have compiled the camera ready paper with the latest ICML2025 style files <https://media.icml.cc/Conferences/ICML2025/Styles/icml2025.zip> and the compiled PDF includes an unnumbered Impact Statement section.

Paper Verification Code: ZTc4Y

Permissions Form: pdf

Primary Area: Research Priorities, Methodology, and Evaluation

Keywords: Machine learning for healthcare, LLM for medicine, Model evaluation, Benchmarks, Construct validity

Submission Number: 303

Loading