More than Just Printing "Hijacked!": Automatic and Universal Prompt Injection Attacks against Large Language Models

ACL ARR 2024 April Submission578 Authors

16 Apr 2024 (modified: 01 Jun 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large Language Models (LLMs) excel in processing and generating human language, powered by their ability to interpret and follow instructions. However, their capabilities can be exploited through prompt injection attacks. Research in this area primarily depends on manually created prompts for attacks and is further challenged by the absence of a unified objective function that reflects real-world risks, complicating comprehensive and accurate assessments of prompt injection robustness. In this paper, we introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data, even in the face of defensive measures. With only five training samples (0.3% relative to the test data), our attack can achieve superior performance compared with baselines. Our findings emphasize the importance of gradient-based testing, which can avoid overestimation of robustness, especially for defense mechanisms.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Turstworthy LLM, Prompt Injection Attacks
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 578
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview