Keywords: LLM Review, Prompt Injection
TL;DR: Empirical evaluation of the effectiveness of prompt injection attacks on LLM generated reviews.
Abstract: The ongoing intense discussion on rising LLM usage in the scientific peer-review
process has recently been mingled by reports of authors using hidden prompt
injections to manipulate review scores. Since the existence of such “attacks” -
although seen by some commentators as “self-defense” - would have a great im-
pact on the further debate, this paper investigates the practicability and technical
success of the described manipulations.
Our systematic evaluation uses 1k reviews of 2024 ICLR papers generated by a
wide range of LLMs shows two distinct results: I) very simple prompt injec-
tions are indeed highly effective, reaching up to 100% acceptance scores. II)
LLM reviews are generally biased toward acceptance (>95% in many mod-
els). Both results have great impact on the ongoing discussions on LLM usage in
peer-review.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 5498
Loading