Keywords: Mulit-Turn Instruction Following, Benchmark, Medical, LLM
TL;DR: We introduce MedMT-Bench, a new challenging benchmark for long medical conversations. We find that current SOTA LLMs severely fail in long-term memory, understanding and safety, revealing critical risks for real-world deployment.
Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities across various specialist domains and have been integrated into high-stakes areas such as medicine. However, systematically evaluating their capabilities remains a significant challenge, as existing medical-related benchmarks often focus on single-turn tasks or short dialogues and rarely stress-test the long-context memory, interference robustness, and safety defense required in practice. To bridge this gap, we introduce MedMT-Bench, a challenging medical multi-turn instruction following benchmark that simulates the entire diagnosis and treatment process, spanning pre-diagnosis, in-diagnosis, and post-diagnosis stages. Motivated by the practical problems observed in real-world implementations, MedMT-Bench operationalizes five core capabilities: 1) long-context memory and understanding; 2) resistance to contextual interference; 3) self-correction, affirmation and safety defense; 4) instruction clarification; and 5) multi-instruction response with interference. We construct the benchmark via scene-by-scene data synthesis refined by manual expert editing, yielding 400 test cases with an average of 22 turns (maximum 52), covering 24 departments and 9 sub-scenarios, including a multimodal subset. For evaluation, we propose an LLM-as-judge protocol with instance-level rubrics and atomic test points, validated against expert annotations with a human-LLM agreement of 91.94\%. We test 17 frontier models, all of which underperform on MedMT-Bench (overall accuracy below 60.00\%), with the best model reaching 59.75\%. MedMT-Bench can be an essential tool for driving future research towards safer and more reliable medical AI. The benchmark is available in the supplementary materials.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 7667
Loading