Rethinking MedAgentBench: A Framework for Fair Medical LLM Agent Evaluation

Ananya Mantravadi; Prasanna Desikan; Abhishek Mukherji

Rethinking MedAgentBench: A Framework for Fair Medical LLM Agent Evaluation

Ananya Mantravadi, Prasanna Desikan, Abhishek Mukherji

Published: 23 May 2026, Last Modified: 23 May 2026ACM CAIS 2026: RLEval Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Clinical AI, LLM, Evaluations, Benchmarks, Healthcare

TL;DR: We evaluate and analyse existing MedAgentBench Medical LLM agents benchmark, show that a do-nothing agent scores 42% on MedAgentBench v2, fix four benchmark failures, and release MAB-v3 with corrected scores

Abstract: MedAgentBench (MAB) v1 and v2 originally published by Stanford University are the primary benchmarks for evaluating large language models (LLMs) that query patient records through FHIR (Fast Healthcare Interoperability Resources) APIs and execute clinical orders. We audit both benchmarks and identify four evaluation failures that, together, allow a do-nothing agent to score 42% on v2 — before any clinical reasoning is demonstrated. First, branch imbalance: four v2 task types have 70–97% of instances on the no-action branch due to cohort composition. Second, the silent-finish ceiling — the measurable consequence: 41.7% of v2 tasks pass when an agent returns an empty result with no tool use, setting a floor that any reported score must be read against (only 5.3% on v1). Third, undocumented format requirements: graders for v1 task 5, v1 task 9, and v2 task 3 enforce format conventions absent from the task context causing systematic 0% pass rates on clinically correct responses. Fourth, a wall-clock bug in the v2-T1 grader anchors the 12-month CT follow-up window to real time rather than the scenario date; 4 of 30 patients who should require no new scan are misclassified as action-required when the benchmark is run in 2025–2026, breaking reproducibility. We construct MedAgentBench-v3, a corrected 508-task benchmark with fixed graders, a frozen timestamp, and 1:1 action/no-action balance. The do-nothing ceiling falls from 41.7% to 8.9%; on MAB-v3 frontier models score 50–79% overall (27–70 pp net above the do-nothing baseline), with substantial divergence between action-branch and no-action-branch pass rates revealing calibration differences invisible in aggregate scores.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 6

Loading