Keywords: Medical QA, Benchmarking, LLMs, Medical AI, Human Evaluation
Abstract: There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA). Most existing benchmarks for medical QA evaluation focus on automatic metrics and multiple-choice questions. While valuable, these benchmarks do not fully capture or assess the complexities of real-world clinical applications where LLMs are being deployed. Furthermore, the limited studies on long-form answer generation in medical QA are primarily closed-source, with no access to human medical expert annotations, making it difficult to reproduce results and improve baselines. In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors. We conduct pairwise comparisons of responses from various open and closed medical and general-purpose LLMs based on criteria such as correctness, helpfulness, harmfulness, and bias. Additionally, we perform a comprehensive LLM-as-a-judge analysis to study the alignment between human judgments and LLMs. Our preliminary results highlight the strong potential of open LLMs in medical QA compared to leading closed models.
Supplementary Material: zip
Submission Number: 95
Loading