MedExpert: An Expert-Annotated Dataset for Medical Chatbot Evaluation

Mahsa Yarmohammadi, Alexandra DeLucia, Lillian C. Chen, Leslie Miller, Heyuan Huang, Sonal Joshi, Jonathan Lasko, Sarah Collica, Ryan Moore, Haoling Qiu, Peter P Zandi, Damianos Karakos, Mark Dredze

Published: 27 Nov 2025, Last Modified: 09 Dec 2025ML4H 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Dataset, Mental health, Prenatal care, Benchmark datasets, Data-centric AI, Factuality, Hallucination, Omission, Completeness, Long-form medical question-answer, Large language models
Track: Proceedings
Abstract: Large language models (LLMs) can create compelling patient-facing medical chatbots, but their reliability in clinical settings remains a concern due to the accuracy of their responses. To better evaluate patient-facing LLM generations, we introduce MedExpert, a comprehensive dataset featuring clinician-created questions and annotations to assess the accuracy and reliability of LLM-generated medical responses. MedExpert comprises 540 question–response pairs in two specialties—young adult mental health and prenatal care—each annotated by clinical subject-matter experts for aspects such as factual accuracy and complete- ness. The dataset provides a framework for exploring these issues in medical chatbots, and to evaluate automatic error detection systems in these domains.
General Area: Applications and Practice
Specific Subject Areas: Dataset Release & Characterization, Natural Language Processing, Foundation Models, Public & Social Health, Evaluation Methods & Validity
Data And Code Availability: No
Ethics Board Approval: Yes
Entered Conflicts: I confirm the above
Anonymity: I confirm the above
Submission Number: 121
Loading