When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

When Reject Turns into Accept: Quantifying the Vulnerability of LLM-Based Scientific Reviewers to Indirect Prompt Injection

ACL ARR 2026 January Submission10459 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM-as-a-Judge, Prompt injection, Adversarial PDF attacks, Peer review integrity, Document sanitization, Robustness evaluation

Abstract: Driven by surging submission volumes, scientific peer review has catalyzed two parallel trends: individual over-reliance on LLMs and institutional AI-powered assessment systems. This study investigates the robustness of "LLM-as-a-Judge" systems to adversarial PDF manipulation via invisible text injections and layout-aware encoding attacks. We specifically target the distinct incentive of flipping "Reject" decisions to "Accept", a vulnerability that fundamentally compromises scientific integrity. To measure this, we introduce the Weighted Adversarial Vulnerability Score (WAVS), a novel metric that quantifies susceptibility by weighting score inflation against the severity of decision shifts relative to ground truth. We adapt 15 domain-specific attack strategies, ranging from semantic persuasion to cognitive obfuscation, and evaluate them across 13 diverse language models (including GPT-5 and DeepSeek) using a curated dataset of 200 official and real-world accepted and rejected submissions (e.g., ICLR OpenReview). Our results demonstrate that obfuscation techniques like "Maximum Mark Magyk" and "Symbolic Masking \& Context Redirection" successfully manipulate scores, achieving decision flip rates of up to 86.26\% in open-source models, while exposing distinct "reasoning traps" in proprietary systems. We release our complete dataset and injection framework to facilitate further research on the topic.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Adversarial attacks on LLMs, Prompt injection in documents, Invisible text injection, Layout-aware encoding attacks, Robustness and red teaming, LLM-as-a-Judge evaluation, Peer-review security and integrity, Benchmarking and vulnerability metrics

Contribution Types: NLP engineering experiment, Data resources, Data analysis

Languages Studied: English

Submission Number: 10459

Loading