Keywords: Benchmarks, Multi-document Reasoning, Medical AI
TL;DR: We introduce MedEvidence, a benchmark to test if LLMs can replicate expert systematic reviews.
Abstract: Systematic reviews (SR), in which experts summarize and analyze evidence across
individual studies to provide insights on a specialized topic, are a cornerstone
for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large
language models (LLMs) to automate SR generation. However, the ability of
LLMs to critically assess evidence and reason across multiple documents to provide recommendations at the same proficiency as domain experts remains poorly
characterized. We therefore ask: Can LLMs match the conclusions of systematic
reviews written by clinical experts when given access to the same studies?
To explore this question, we present MedEvidence, a benchmark pairing findings
from 100 SRs with the studies they are based on. We benchmark 24 LLMs on
MedEvidence, including reasoning, non-reasoning, medical specialist, and models
across varying sizes (from 7B-700B). Through our systematic evaluation, we find
that reasoning does not necessarily improve performance, larger models do not
consistently yield greater gains, and knowledge-based fine-tuning degrades accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance
tends to degrade as token length increases, their responses show overconfidence,
and, contrary to human experts, all models show a lack of scientific skepticism
toward low-quality findings. These results suggest that more work is still required
before LLMs can reliably match the observations from expert-conducted SRs, even
though these systems are already deployed and being used by clinicians. We release our codebase
and benchmark
to the broader research community to further
investigate LLM-based SR systems.
Supplementary Material: pdf
Primary Area: datasets and benchmarks
Submission Number: 16621
Loading