More than Just Printing "Hijacked!": Automatic and Universal Prompt Injection Attacks against Large Language Models
Abstract: Large Language Models (LLMs) excel in processing and generating human language, powered by their ability to interpret and follow instructions. However, their capabilities can be exploited through prompt injection attacks. Research in this area primarily depends on manually created prompts for attacks and is further challenged by the absence of a unified objective function that reflects real-world risks, complicating comprehensive and accurate assessments of prompt injection robustness. In this paper, we introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data, even in the face of defensive measures. With only five training samples (0.3% relative to the test data), our attack can achieve superior performance compared with baselines. Our findings emphasize the importance of gradient-based testing, which can avoid overestimation of robustness, especially for defense mechanisms.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Turstworthy LLM, Prompt Injection Attacks
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 578
Loading