Keywords: MultiModal Medical Benchmark
TL;DR: LiveClin is a live benchmark that evaluates medical LLMs on the entire clinical pathway
Abstract: The reliability of medical LLM evaluation is critically undermined by data contamination and knowledge obsolescence, leading to inflated scores on static benchmarks. To address these challenges, we introduce LiveClin, a live benchmark designed for the faithful replication of clinical practice. Built from contemporary, peer-reviewed case reports and updated biannually, LiveClin ensures clinical currency and resists data contamination. Using a verified AI-human workflow involving 239 physicians, we transform authentic patient cases into complex, multimodal evaluation scenarios that span the entire clinical pathway. The benchmark currently comprises 1,407 case reports and 6,605 questions. Our evaluation of 26 models on LiveClin reveals the profound difficulty of these real-world scenarios, with the top-performing model achieving a Case Accuracy of just 35.7%. We find that the era of "free lunch" improvements from simple model scaling is over, as newer models do not consistently outperform their predecessors. Furthermore, our analysis uncovers distinct reasoning weaknesses across model classes. LiveClin thus provides a continuously evolving, clinically-grounded framework to steer the development of medical LLMs towards greater reliability and real-world utility.
Primary Area: datasets and benchmarks
Submission Number: 19066
Loading