Keywords: Medical LLM evaluation, Clinical dialogue, Multi-turn question answering, persona-aware evaluation, Reliability and fairness (in medical NLP)
Abstract: Medical benchmarks are dominated by single-turn, multiple-choice clinical cases that poorly reflect real consultations. Practically, clinicians elicit evidence interactively and patient communication varies widely. We introduce MEDROUNDSQA, a multi-turn diagnostic benchmark derived from 1,387 board exam cases across 17 specialties. Each case is converted into a structured 24-slot clinical record, and then instantiated as controlled doctor-patient dual-agent dialogues under varying patient personas, with the underlying clinical content held fixed. We further classify cases by difficulty using model-based uncertainty to enable easy-to-hard analysis. Evaluations of five LLM doctor agents show that (i) moving from a single-turn diagnosis on the standardized records to multi-turn consultations causes large degradations of roughly 25–39%; (ii) more turns reliably improves question relevance, but diagnostic accuracy exhibits diminishing returns and typically plateaus after 6–12 turns; and (iii) patient persona differences can shift diagnosis accuracy by about 7–8 points (lowest to highest education), highlighting equity risks that single-turn benchmarks miss.
Paper Type: Long
Research Area: Clinical and Biomedical Applications
Research Area Keywords: biomedical QA, healthcare applications, clinical NLP
Contribution Types: NLP engineering experiment, Data resources, Data analysis
Languages Studied: English
Submission Number: 2256
Loading