Can LLMs Match the Observations of Systematic Reviews?

Christopher Polzak; Alejandro Lozano; Min Woo Sun; James Burgess; Yuhui Zhang; Kevin Wu; Serena Yeung-Levy

Can LLMs Match the Observations of Systematic Reviews?

Christopher Polzak, Alejandro Lozano, Min Woo Sun, James Burgess, Yuhui Zhang, Kevin Wu, Serena Yeung-Levy

12 May 2025 (modified: 30 Oct 2025)Submitted to NeurIPS 2025 Datasets and Benchmarks TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Large language models (LLMs), Model benchmarking, Evidence-based medicine, Systematic reviews

TL;DR: We introduces the MedEvidence dataset to evaluate how well LLMs can replicate human conclusions in systematic medical reviews.

Abstract: Systematic reviews (SR), in which experts summarize and analyze evidence across individual studies to provide insights on a specialized topic, are a cornerstone for evidence-based clinical decision-making, research, and policy. Given the exponential growth of scientific articles, there is growing interest in using large language models (LLMs) to automate this process. However, the ability of LLMs to critically assess evidence and reason across multiple documents to provide expert-quality observations remains poorly characterized. We therefore ask: Can LLMs match the conclusions of systematic reviews written by clinical experts when given access to the same studies? To explore this question, we present MedEvidence, a benchmark pairing findings from SRs with the studies they are based on. We benchmark 24 LLMs on our MedEvidence dataset, including reasoning, medical specialist, and models of varying sizes. We find that reasoning does not necessarily improve performance, larger models do not consistently yield greater gains, and knowledge-based fine-tuning tends to degrade accuracy on MedEvidence. Instead, most models exhibit similar behavior: performance tends to degrade as token length increases, their responses show overconfidence, and all models show a lack of scientific skepticism toward low-quality findings. These results suggest that more work is still required before LLMs can reliably match the observations from expert-conducted SRs, even though these systems are already deployed and being used by clinicians.

Croissant File: json

Dataset URL: https://huggingface.co/datasets/lekjfk21/med-evidence

Code URL: https://github.com/zyfetcabc/medevidence

Supplementary Material: pdf

Primary Area: AL/ML Datasets & Benchmarks for health sciences (e.g. climate, health, life sciences, physics, social sciences)

Submission Number: 2149

Loading