FAT: A Prompt Injection Attack Utilizing Feign Security Agents with Deceptive One-shot Learning

ACL ARR 2025 February Submission2839 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract:

The security of LLMs has emerged as a critical research area in recent years. Despite their remarkable capabilities, LLMs are inherently vulnerable to various security threats. To address these challenges, researchers have adopted techniques such as RLHF to enhance the ethical and responsible behavior of LLMs. However, this approach also introduces a potential risk: the exposure of LLMs to extensive security-related corpora during training may inadvertently lead to an over reliance on or blind trust in security-related information. To investigate this issue, we propose a novel attack method termed FAT. By obfuscating malicious instructions, fabricating deceptive security claims, and leveraging crafted one-shot examples, this method can manipulate LLMs into generating harmful content. To evaluate the effectiveness of the FAT attack, we introduce the FAT-Query dataset and conduct experiments on various LLMs. The results demonstrate that mainstream models, such as GPT-4o and Deepseek-R1, are susceptible to this attack. Furthermore, we propose a defense mechanism based on DPO to mitigate the impact of FAT attacks. Experimental results show that DPO effectively reduces the attack success rate on Llama-3.1-8B-Ins from 89.4% to 0.9%, significantly enhancing the model's robustness against such threats. These findings underscore the potential dangers posed by the FAT attack and highlight the critical importance of scrutinizing the sources of security-related information used during LLM training. A shift away from uncritical reliance on security-related data is essential to ensure the development of more secure and reliable LLMs.

Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: reflections and critiques, data ethics, model bias/fairness evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment
Languages Studied: English
Submission Number: 2839
Loading