A Pipeline to Bootstrap the Evaluation of Retrieval-Augmented Generation for the Automation of Systematic Reviews in Computer Science
Keywords: Systematic Reviews, RAG, Deep Research Agents, Benchmark
Abstract: Automating the creation of systematic reviews, i.e., evidence-driven analyses of a specific area of research, is a natural target for retrieval-augmented generation and deep research agents, yet existing benchmarks evaluate only isolated subtasks or assume fixed evidence inputs.
We introduce RAG4SR-CS-200, a benchmark of 200 computer science systematic reviews designed for protocol-driven systematic review automation. Each instance comprises review objectives, research questions, eligibility criteria, cleaned full-text review structure, references, and extracted tables.
These elements support evaluation across key tasks in systematic review creation such as literature retrieval, eligibility screening, citation-grounded review generation, and structured table generation, in both stage-wise and end-to-end settings.
In short, RAG4SR-CS-200 provides the foundation for developing more reliable and diagnosable deep research agents for scientific evidence synthesis.
Code and data are publicly available (https://github.com/webis-de/rag4sr-cs-200).
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 11
Loading