SciLitBench: Benchmark and Design Principles for LLM-Powered Systematic Literature Reviews

SciLitBench: Benchmark and Design Principles for LLM-Powered Systematic Literature Reviews

ICLR 2026 Conference Submission22305 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, benchmark, dataset, literature review, screening, data extraction, abstract screening, design principle, systematic review

Abstract: Systematic literature reviews are essential for science but remain labor-intensive. To benchmark and improve automation, we introduce SciLitBench, a new dataset of 42,980 curated abstracts, 2,311 full texts, and TODOXXX structured data elements (e.g. study-level metadata, PICO entities, outcome measures, and evidence) annotated and labeled for inclusion decisions and knowledge reasoning. Across 22 open-source large language models (LLMs), we uncover general design principles that make automation reliable under recall-skewed objectives (F$_2$). First, we observe clear scaling and prompt-design effects: explicit inclusion/exclusion prompting improves accuracy by up to +29\%, while adding researcher ``thought traces'' yields a +28\% gain. Second, we show that reliability under full-text screening depends sharply on the interaction between context length and model capacity. Motivated by this, we introduce a token-length–aware routing system that surpasses ensembles of strongest models ($F_2 = 0.949$ vs $0.938$). Finally, we demonstrate a human-in-the-loop, rubric-guided extraction workflow that separate field extraction from guideline adherence checking to align model outputs with domain standards given researcher feedback. Together, our benchmark and findings establish scaling, prompt design, thought traces, and adaptive routing as key principles for reliable, researcher-aligned automation of systematic reviews.

Primary Area: datasets and benchmarks

Submission Number: 22305

Loading