Can Large Language Models Think Like Doctors? An Interactive Approach to Evaluating Clinical Reasoning

18 Sept 2025 (modified: 23 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Clinical Reasoning, Benchmark, LLM
TL;DR: iClinReason is an interactive framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues.
Abstract: Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather targeted information , determine examination and refine their differential diagnosis through patients' response. This interactive clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on question-answering or multiple-choice format. In this work, we propose iClinReason, an interactive framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases with structured symptom profiles, and instantiates a patient agent that engages in a multi-turn diagnostic conversation with the target LLM, which acts as a doctor agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and a rubric-based assessment of diagnostic quality across multiple dimensions. Experimental results reveal that iClinReason effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 12044
Loading