Can Large Language Models Think Like Doctors? An Interactive Approach to Evaluating Clinical Reasoning

Jing Yu; Zichang Su; Kehua Feng; Haohao Xu; Lei Liang; Qiang Zhang; Keyan Ding; Huajun Chen

Can Large Language Models Think Like Doctors? An Interactive Approach to Evaluating Clinical Reasoning

Jing Yu, Zichang Su, Kehua Feng, Haohao Xu, Lei Liang, Qiang Zhang, Keyan Ding, Huajun Chen

18 Sept 2025 (modified: 23 Nov 2025)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Clinical Reasoning, Benchmark, LLM

TL;DR: iClinReason is an interactive framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues.

Abstract: Clinical diagnosis begins with doctor-patient interaction, during which physicians iteratively gather targeted information , determine examination and refine their differential diagnosis through patients' response. This interactive clinical-reasoning process is poorly represented by existing LLM benchmarks that focus on question-answering or multiple-choice format. In this work, we propose iClinReason, an interactive framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues. Grounded in a disease knowledge graph, our method dynamically generates patient cases with structured symptom profiles, and instantiates a patient agent that engages in a multi-turn diagnostic conversation with the target LLM, which acts as a doctor agent. Our evaluation protocol goes beyond diagnostic accuracy by incorporating fine-grained efficiency analysis and a rubric-based assessment of diagnostic quality across multiple dimensions. Experimental results reveal that iClinReason effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs, offering a more nuanced and clinically meaningful evaluation paradigm.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 12044

Loading