Abstract: Large Language Models (LLMs) have demonstrated impressive performance on existing medical reasoning benchmarks. This high performance makes it increasingly difficult to meaningfully evaluate and differentiate advanced methods. We present MedAgentSArena, a carefully curated benchmark that focuses on challenging medical questions where current models still struggle. Drawing from seven established medical datasets, our benchmark addresses three key limitations in existing evaluations: the prevalence of straightforward questions where even base models achieve high performance, inconsistent sampling and evaluation protocols across studies, and lack of systematic analysis of the interplay between performance, cost, and inference time. Through experiments with various base models and reasoning methods, we demonstrate that the latest thinking models, DeepSeek R1 and OpenAI O3, exhibit exceptional performance in complex medical reasoning tasks. Additionally, advanced search-based agent methods also perform effectively in handling intricate medical queries. Our benchmark and evaluation framework are publicly available at https://anonymous.4open.science/r/MedAgents-Benchmark.
Paper Type: Short
Research Area: NLP Applications
Research Area Keywords: NLP Applications, clinical NLP, medical NLP, Large Language Models
Contribution Types: Model analysis & interpretability, Data resources, Data analysis
Languages Studied: English
Submission Number: 7901
Loading