Keywords: AI, Computer Security
TL;DR: We show that subtle stylistic cues can bias AI peer reviewers and propose simple, testable defenses to safeguard AI-native science.
Abstract: As AI systems increasingly both generate and evaluate scientific work, the research
pipeline itself becomes an attack surface. We argue that indirect prompt injection
(IPI) —stylistic or structural choices that appear legitimate to humans but steer
automated heuristics—poses a systemic risk for AI-native peer review. Rather than
releasing exploits, we adopt a demonstration-through-design methodology, define
reproducible susceptibility metrics (SI, PS, RV, CCG), and introduce safe tests: the
Paraphrase Invariance Test (PIT) and Claim–Evidence Alignment (CEA). A small
synthetic benchmark across three LLM reviewers shows style-only obfuscation
inflates novelty and overall scores. We conclude with concrete detection and
governance recommendations, providing a defensible foundation for studying and
mitigating IPI in AI-native science.
Supplementary Material: pdf
Submission Number: 328
Loading