MedAgentsArena: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

MedAgentsArena: Benchmarking Thinking Models and Agent Frameworks for Complex Medical Reasoning

ACL ARR 2025 February Submission7901 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) have demonstrated impressive performance on existing medical reasoning benchmarks. This high performance makes it increasingly difficult to meaningfully evaluate and differentiate advanced methods. We present MedAgentSArena, a carefully curated benchmark that focuses on challenging medical questions where current models still struggle. Drawing from seven established medical datasets, our benchmark addresses three key limitations in existing evaluations: the prevalence of straightforward questions where even base models achieve high performance, inconsistent sampling and evaluation protocols across studies, and lack of systematic analysis of the interplay between performance, cost, and inference time. Through experiments with various base models and reasoning methods, we demonstrate that the latest thinking models, DeepSeek R1 and OpenAI O3, exhibit exceptional performance in complex medical reasoning tasks. Additionally, advanced search-based agent methods also perform effectively in handling intricate medical queries. Our benchmark and evaluation framework are publicly available at https://anonymous.4open.science/r/MedAgents-Benchmark.

Paper Type: Short

Research Area: NLP Applications

Research Area Keywords: NLP Applications, clinical NLP, medical NLP, Large Language Models

Contribution Types: Model analysis & interpretability, Data resources, Data analysis

Languages Studied: English

Submission Number: 7901

Loading