Keywords: LLM Reasoning, Medical Diagnosis
Abstract: In high-stakes domains like medicine, $\textbf{how}$ an AI arrives at an answer can be as critical as the answer itself. However, existing medical question answering benchmarks largely ignore the reasoning process, evaluating models only on final answer accuracy. This paper addresses the overlooked importance of reasoning path evaluation in medical AI. We introduce $\textbf{MedReason-Dx}$, a novel benchmark that assesses not just answers but the step-by-step reasoning behind them. MedReason-Dx provides expert-annotated step-by-step solutions for both multiple-choice and open-ended questions, spanning 24 medical specialties. By requiring models to produce and be evaluated on intermediate reasoning steps, our benchmark enables rigorous testing of interpretability and logical consistency in medical QA. We present the design of MedReason-Dx and outline diverse evaluation metrics that reward faithful reasoning. We hope this resource will advance the development of robust, interpretable medical decision support systems and foster research into large language models that can reason as well as they respond.
Primary Area: datasets and benchmarks
Submission Number: 11353
Loading