A Pipeline to Bootstrap the Evaluation of Retrieval-Augmented Generation for the Automation of Systematic Reviews in Computer Science

Pierre Achkar; Tim Gollub; Arno Simons; Harrisen Scells; Maik Fröbe; Martin Potthast

A Pipeline to Bootstrap the Evaluation of Retrieval-Augmented Generation for the Automation of Systematic Reviews in Computer Science

Pierre Achkar, Tim Gollub, Arno Simons, Harrisen Scells, Maik Fröbe, Martin Potthast

Published: 01 May 2026, Last Modified: 01 May 2026RAG4Report 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: Systematic Reviews, RAG, Deep Research Agents, Benchmark

Abstract: Automating the creation of systematic reviews, i.e., evidence-driven analyses of a specific area of research, is a natural target for retrieval-augmented generation and deep research agents, yet existing benchmarks evaluate only isolated subtasks or assume fixed evidence inputs. We introduce RAG4SR-CS-200, a benchmark of 200 computer science systematic reviews designed for protocol-driven systematic review automation. Each instance comprises review objectives, research questions, eligibility criteria, cleaned full-text review structure, references, and extracted tables. These elements support evaluation across key tasks in systematic review creation such as literature retrieval, eligibility screening, citation-grounded review generation, and structured table generation, in both stage-wise and end-to-end settings. In short, RAG4SR-CS-200 provides the foundation for developing more reliable and diagnosable deep research agents for scientific evidence synthesis. Code and data are publicly available (https://github.com/webis-de/rag4sr-cs-200).

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 11

Loading