everyone
since 18 Jun 2025">EveryoneRevisionsBibTeXCC BY 4.0
Medical large language models (LLMs) research often makes bold claims, from encoding clinical knowledge to reasoning like a physician. These claims are usually backed by evaluation on competitive benchmarks—a tradition inherited from mainstream machine learning. But how do we separate real progress from a leaderboard flex? Medical LLM benchmarks, much like those in other fields, are arbitrarily constructed using medical licensing exam questions. For these benchmarks to truly measure progress, they must accurately capture the real-world tasks they aim to represent. In this position paper, we argue that medical LLM benchmarks should—and indeed can—be empirically evaluated for their construct validity. In the psychological testing literature, “construct validity” refers to the ability of a test to measure an underlying “construct”, that is the actual conceptual target of evaluation. By drawing an analogy between LLM benchmarks and psychological tests, we explain how frameworks from this field can provide empirical foundations for validating benchmarks. To put these ideas into practice, we use real-world clinical data in proof-of-concept experiments to evaluate popular medical LLM benchmarks and report significant gaps in their construct validity. Finally, we outline a vision for a new ecosystem of medical LLM evaluation centered around the creation of valid benchmarks.
Many studies claim that medical large language models (LLMs) are highly capable---often based on how well they do on multiple-choice, exam-style tests. In this paper, we argue that those test scores don't truly reflect the messy, complicated reality of taking care of real patients in real hospitals. This problem isn't limited to just medical LLMs. In general, we tend to treat these advanced models as if they're intelligent "agents" that can manifest some latent "capabilities" in open-ended tasks. Yet, we still test them the same way we test simpler and narrower models, such as image classifiers. We draw parallels between the "capabilities" of LLMs and psychological traits such as intelligence---both are latent and complex constructs that cannot be directly observed but manifests in multifaceted ways through the ability to conduct certain tasks. Based on this analogy, we suggest borrowing a concept from psychology known as "construct validity"---the idea that a test should actually measure the skill it claims to---as a foundational principle to evaluate and design benchmarks for LLMs. We applied empirical tools for evaluating construct validity inspired by the psychometrics literature to medical LLM benchmarks, and found that even models with top scores on popular benchmarks often didn’t do well when working with real patient records. We propose a vision for a "benchmark-validation-first" culture for model evaluation, where make sure the construct validity of benchmarks are evaluated using real hospital data before using them to judge model quality. That way, we can evaluate medical LLMs based on what actually matters in clinical care—not just how well it answers test questions.