Indirect Prompt Injection in AI-Native Peer Review: Risks, Detection, and Defenses

Agents4Science 2025 Conference Submission328 Authors

17 Sept 2025 (modified: 08 Oct 2025)Submitted to Agents4ScienceEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI, Computer Security
TL;DR: We show that subtle stylistic cues can bias AI peer reviewers and propose simple, testable defenses to safeguard AI-native science.
Abstract: As AI systems increasingly both generate and evaluate scientific work, the research pipeline itself becomes an attack surface. We argue that indirect prompt injection (IPI) —stylistic or structural choices that appear legitimate to humans but steer automated heuristics—poses a systemic risk for AI-native peer review. Rather than releasing exploits, we adopt a demonstration-through-design methodology, define reproducible susceptibility metrics (SI, PS, RV, CCG), and introduce safe tests: the Paraphrase Invariance Test (PIT) and Claim–Evidence Alignment (CEA). A small synthetic benchmark across three LLM reviewers shows style-only obfuscation inflates novelty and overall scores. We conclude with concrete detection and governance recommendations, providing a defensible foundation for studying and mitigating IPI in AI-native science.
Supplementary Material: pdf
Submission Number: 328
Loading