Abstract: Current medical AI models are trained primarily on static articles and question-answering (QA) tasks, and then evaluated on similar QA benchmarks. However, previous approaches fail to capture the dynamic real-world nature of clinical reasoning, particularly in handling ambiguous inputs (e.g., conflicting symptoms) and multi-step decision-making. To address this, we: \ding{182} introduce a comprehensive diagnostic benchmark, \textbf{MuddyMaze}, evaluating clinical reasoning with controlled noise and USMLE-aligned difficulty levels; \ding{183} curate a new dialogue dataset by converting 10.2k medical QA pairs and 12k PubMed articles into clinician-patient interactions; and \ding{184} develop dialogue-based fine-tuning that enhances reasoning capabilities. Experiments demonstrate significant improvements over traditional methods (+16.10\% in one-round accuracy and +4.06\% in multi-round reasoning), validating that dialogue-based training better aligns AI systems with real clinical workflows.
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: Dialogue and Interactive Systems; Healthcare applications; Clinical NLP;
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models, Data resources
Languages Studied: english
Submission Number: 4401
Loading