RefereeSim: A Proof-of-Concept Evaluation Framework for AI-Powered Scientific Paper Reviewers

11 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Reviewer, Automated error detection, Research Evaluation methodologies
TL;DR: We introduce RefereeSim, a lightweight evaluation platform that stress-tests AI “reviewers” with synthetic papers in which errors are deliberately seeded un- der full ground truth.
Abstract: Motivation. Scientific peer review is under pressure from ever–growing sub- mission volumes and long delays, while the capabilities of large language mod- els (LLMs) invite the question: can AI reliably assist reviewers? Approach. We introduce RefereeSim, a lightweight evaluation platform that stress-tests AI “reviewers” with synthetic papers in which errors are deliberately seeded un- der full ground truth. This proof-of-concept study injects a single, concrete inconsistency—a sample-size misreport between the abstract (2068) and the meth- ods (1991)—and asks 11 production LLMs spanning five model families to re- view the paper under identical prompts. Findings. Only 4 of 11 models (36.4%) identified the discrepancy. Detection was perfect within the Cohere (2/2) and Gemini (2/2) families, and absent for DeepSeek (0/3), Llama (0/3), and the eval- uated OpenAI model (0/1). Successful models (i) explicitly compared numbers across sections, (ii) stated the inconsistency, and (iii) recommended correction. Contributions. (1) A transparent, reproducible evaluation pipeline that aligns reviewer outputs with seeded ground truth; (2) a first multi-vendor snapshot on a core consistency task; and (3) actionable guidance for building AI-assisted re- viewing workflows. Implications. Even under favorable, controlled conditions, many models miss basic cross-section consistency checks, underscoring the need for structured reasoning passes and human oversight before deployment in peer review.
Supplementary Material: pdf
Submission Number: 101
Loading