Keywords: medical text validation, clinical nlp, llm-as-judge, self-supervised learning
TL;DR: Our research provides evidence of language models approaching expert-level ability in validating AI-generated medical text.
Abstract: With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. To address these challenges, we propose MedVAL, a novel, self-supervised, data-efficient distillation method that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset of 840 physician-annotated outputs across 6 diverse medical tasks capturing real-world challenges. Across 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL distillation significantly improves ($p < 0.001$) alignment with physicians across seen and unseen tasks, increasing average F1 scores from 66\% to 83\%. Despite strong baseline performance, MedVAL improves the best-performing proprietary LM (GPT-4o) by 8\% without training on physician-labeled data, demonstrating a performance statistically non-inferior to a single human expert ($p < 0.001$). Our benchmark provides evidence of LMs approaching expert-level ability in validating AI-generated medical text.
Submission Number: 83
Loading