Keywords: Dataset, Mental health, Prenatal care, Benchmark datasets, Data-centric AI, Factuality, Hallucination, Omission, Completeness, Long-form medical question-answer, Large language models
Track: Proceedings
Abstract: Large language models (LLMs) can create compelling patient-facing medical chatbots, but their reliability in clinical settings remains a concern due to the accuracy of their responses. To better evaluate patient-facing LLM generations, we introduce MedExpert, a comprehensive dataset featuring clinician-created questions and annotations to assess the accuracy and reliability of LLM-generated medical responses. MedExpert comprises 540 question–response pairs in two specialties—young adult mental health and prenatal care—each annotated by clinical subject-matter experts for aspects such as factual accuracy and complete-
ness. The dataset provides a framework for exploring these issues in medical chatbots, and to evaluate automatic error detection systems in these domains.
General Area: Applications and Practice
Specific Subject Areas: Dataset Release & Characterization, Natural Language Processing, Foundation Models, Public & Social Health, Evaluation Methods & Validity
Data And Code Availability: No
Ethics Board Approval: Yes
Entered Conflicts: I confirm the above
Anonymity: I confirm the above
Submission Number: 121
Loading