More than Just Printing "Hijacked!": Automatic and Universal Prompt Injection Attacks against Large Language Models

More than Just Printing "Hijacked!": Automatic and Universal Prompt Injection Attacks against Large Language Models

ACL ARR 2024 April Submission578 Authors

16 Apr 2024 (modified: 01 Jun 2024)ACL ARR 2024 April SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) excel in processing and generating human language, powered by their ability to interpret and follow instructions. However, their capabilities can be exploited through prompt injection attacks. Research in this area primarily depends on manually created prompts for attacks and is further challenged by the absence of a unified objective function that reflects real-world risks, complicating comprehensive and accurate assessments of prompt injection robustness. In this paper, we introduce a unified framework for understanding the objectives of prompt injection attacks and present an automated gradient-based method for generating highly effective and universal prompt injection data, even in the face of defensive measures. With only five training samples (0.3% relative to the test data), our attack can achieve superior performance compared with baselines. Our findings emphasize the importance of gradient-based testing, which can avoid overestimation of robustness, especially for defense mechanisms.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: Turstworthy LLM, Prompt Injection Attacks

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 578

Loading