MedExpert: An Expert-Annotated Dataset for Medical Chatbot Evaluation

Mahsa Yarmohammadi; Alexandra DeLucia; Lillian C. Chen; Leslie Miller; Heyuan Huang; Sonal Joshi; Jonathan Lasko; Sarah Collica; Ryan Moore; Haoling Qiu; Peter P Zandi; Damianos Karakos; Mark Dredze

MedExpert: An Expert-Annotated Dataset for Medical Chatbot Evaluation

Mahsa Yarmohammadi, Alexandra DeLucia, Lillian C. Chen, Leslie Miller, Heyuan Huang, Sonal Joshi, Jonathan Lasko, Sarah Collica, Ryan Moore, Haoling Qiu, Peter P Zandi, Damianos Karakos, Mark Dredze

Published: 27 Nov 2025, Last Modified: 09 Dec 2025ML4H 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Dataset, Mental health, Prenatal care, Benchmark datasets, Data-centric AI, Factuality, Hallucination, Omission, Completeness, Long-form medical question-answer, Large language models

Track: Proceedings

Abstract: Large language models (LLMs) can create compelling patient-facing medical chatbots, but their reliability in clinical settings remains a concern due to the accuracy of their responses. To better evaluate patient-facing LLM generations, we introduce MedExpert, a comprehensive dataset featuring clinician-created questions and annotations to assess the accuracy and reliability of LLM-generated medical responses. MedExpert comprises 540 question–response pairs in two specialties—young adult mental health and prenatal care—each annotated by clinical subject-matter experts for aspects such as factual accuracy and complete- ness. The dataset provides a framework for exploring these issues in medical chatbots, and to evaluate automatic error detection systems in these domains.

General Area: Applications and Practice

Specific Subject Areas: Dataset Release & Characterization, Natural Language Processing, Foundation Models, Public & Social Health, Evaluation Methods & Validity

Data And Code Availability: No

Ethics Board Approval: Yes

Entered Conflicts: I confirm the above

Anonymity: I confirm the above

Submission Number: 121

Loading