Keywords: AI Reviewer, Automated error detection, Research Evaluation methodologies
TL;DR: We introduce RefereeSim, a lightweight evaluation platform that stress-tests AI “reviewers” with synthetic papers in which errors are deliberately seeded un- der full ground truth.
Abstract: Motivation. Scientific peer review is under pressure from ever–growing sub-
mission volumes and long delays, while the capabilities of large language mod-
els (LLMs) invite the question: can AI reliably assist reviewers? Approach.
We introduce RefereeSim, a lightweight evaluation platform that stress-tests AI
“reviewers” with synthetic papers in which errors are deliberately seeded un-
der full ground truth. This proof-of-concept study injects a single, concrete
inconsistency—a sample-size misreport between the abstract (2068) and the meth-
ods (1991)—and asks 11 production LLMs spanning five model families to re-
view the paper under identical prompts. Findings. Only 4 of 11 models (36.4%)
identified the discrepancy. Detection was perfect within the Cohere (2/2) and
Gemini (2/2) families, and absent for DeepSeek (0/3), Llama (0/3), and the eval-
uated OpenAI model (0/1). Successful models (i) explicitly compared numbers
across sections, (ii) stated the inconsistency, and (iii) recommended correction.
Contributions. (1) A transparent, reproducible evaluation pipeline that aligns
reviewer outputs with seeded ground truth; (2) a first multi-vendor snapshot on
a core consistency task; and (3) actionable guidance for building AI-assisted re-
viewing workflows. Implications. Even under favorable, controlled conditions,
many models miss basic cross-section consistency checks, underscoring the need
for structured reasoning passes and human oversight before deployment in peer
review.
Supplementary Material: pdf
Submission Number: 101
Loading