Experimental Study on Review Overfitting and Adversarial Attacks in AI Peer Review

Experimental Study on Review Overfitting and Adversarial Attacks in AI Peer Review

Agents4Science 2025 Conference Submission95 Authors

08 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Peer Review; Overfitting

TL;DR: LLM peer review is vulnerable to stylistic tweaks exploiting rubric cues; simple evidence-based defences improve robustness, stressing careful prompting and transparency in AI reviewing.

Abstract: Peer review by large language models (LLMs) is susceptible to "overfitting" on rubric cues. Small stylistic modifications can infuence how Al reviewers score apaper, yet simple defences might mitigate this vulnerability. We present a miniature experimental reproduction of the Review-Overfitting Challenge. Four arXiv abstracts from machine learning were assessed against a six-item rubric. We then performed an Al-style attack by rewriting the abstracts to emphasise novelty with-out altering factual content. Borderline papers flipped from borderline to accept. A rubric-anchored defence eliminated the fips, demonstrating that requiring evi-dence for each criterion improves robustness. Our study underscores the need forcareful prompting and transparency when deploying Al reviewers.

Supplementary Material: zip

Submission Number: 95

Loading