MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

Shan Chen; Pedro José Ferreira Moreira; Yuxin Xiao; Samuel Schmidgall; Jeremy L. Warner; Hugo Aerts; Thomas Hartvigsen; Jack Gallifant; Danielle Bitterman

MedBrowseComp: Benchmarking Medical Deep Research and Computer Use

Shan Chen, Pedro José Ferreira Moreira, Yuxin Xiao, Samuel Schmidgall, Jeremy L. Warner, Hugo Aerts, Thomas Hartvigsen, Jack Gallifant, Danielle Bitterman

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: medical information retrieval, multi-hop reasoning, web-enabled agents, large language models, Deep Research

Abstract: Large language models (LLMs) are increasingly envisioned as decision-support tools in clinical practice, yet safe clinical reasoning demands the integration of heterogeneous knowledge bases—trials, primary studies, regulatory documents, and cost data—under strict accuracy constraints. Existing evaluations typically rely on synthetic prompts, reduce the task to single-hop factoid queries, or conflate reasoning with open-ended text generation, leaving their real-world utility unclear. To close this gap, we present \textbf{MedBrowseComp}, the first benchmark that systematically tests an agent’s ability to reliably retrieve and synthesize multi-hop medical facts from up-to-date, domain-specific knowledge bases. MedBrowseComp holds 1,000+ human-curated questions that mirror clinical scenarios in which practitioners must reconcile information fragmented over many sources that are potentially conflicting. Applying MedBrowseComp to frontier agentic systems reveals \textbf{marked performance shortfalls as low as 10\%}. MedBrowseComp reveals critical gaps between current LLM performance and clinical usage, providing a testbed to guide future model and toolchain improvements for reliable medical information seeking.

Submission Number: 55

Loading