Keywords: Peer-review, LLM, Dataset, GenAI
TL;DR: A Dataset and Large-scale Study of AI-Generated and Human-Authored Peer Reviews
Abstract: How does the increased adoption of Large Language Models (LLMs) impact the scientific peer review? This multifaceted question is fundamental to the integrity and outcomes of the scientific process. Timely evidence suggests LLMs may have already been used for peer-review, e.g., at the 2024 International Conference of Learning Representations (ICLR), and the LLMs' integration in peer-review was confirmed by various editorial boards (including that of ICLR'25). To seek answers, a comprehensive dataset is needed, but lacking until now. We therefore present Gen-Review the largest dataset of LLM-written reviews so far. Our dataset includes 81K reviews generated for all submissions to the 2018--2025 editions of the ICLR and by providing the LLM with three independent prompts: a negative, a positive, and a neutral one. Gen-Review also links to the papers and the conference reviews thereby enabling a broad range of investigations. We make a start and use Gen-Review to scrutinize: if LLMs exhibit bias in reviewing (they do); if LLM-written reviews can be automatically detected (so far, they can); if LLMs can rigorously follow reviewing instructions (not always) and whether LLM-provided ratings align with a papers' final outcome (happens only for accepted papers).
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 17560
Loading