MedRoundsQA: A Persona and Difficulty Aware Evaluation for Multi-Turn Medical Consultations

MedRoundsQA: A Persona and Difficulty Aware Evaluation for Multi-Turn Medical Consultations

ACL ARR 2026 January Submission2256 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Medical LLM evaluation, Clinical dialogue, Multi-turn question answering, persona-aware evaluation, Reliability and fairness (in medical NLP)

Abstract: Medical benchmarks are dominated by single-turn, multiple-choice clinical cases that poorly reflect real consultations. Practically, clinicians elicit evidence interactively and patient communication varies widely. We introduce MEDROUNDSQA, a multi-turn diagnostic benchmark derived from 1,387 board exam cases across 17 specialties. Each case is converted into a structured 24-slot clinical record, and then instantiated as controlled doctor-patient dual-agent dialogues under varying patient personas, with the underlying clinical content held fixed. We further classify cases by difficulty using model-based uncertainty to enable easy-to-hard analysis. Evaluations of five LLM doctor agents show that (i) moving from a single-turn diagnosis on the standardized records to multi-turn consultations causes large degradations of roughly 25–39%; (ii) more turns reliably improves question relevance, but diagnostic accuracy exhibits diminishing returns and typically plateaus after 6–12 turns; and (iii) patient persona differences can shift diagnosis accuracy by about 7–8 points (lowest to highest education), highlighting equity risks that single-turn benchmarks miss.

Paper Type: Long

Research Area: Clinical and Biomedical Applications

Research Area Keywords: biomedical QA, healthcare applications, clinical NLP

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 2256

Loading