Abstract: Large Language Models (LLMs) are vulnerable to prompt injection attacks, where adversaries manipulate model behavior through malicious inputs. To mitigate these threats, prompt guard models have been introduced as lightweight defenses that filter inputs before reaching the LLM. However, their adversarial robustness remains largely unexplored. In this paper, we investigate the susceptibility of prompt guard models to adversarial attacks and introduce methods to enhance their resilience. We propose a novel adaptive attack, APGA, which jointly optimizes for bypassing prompt guard detection while inducing the LLM to generate targeted responses. Our attack achieves a 100% success rate across multiple guard models, exposing critical vulnerabilities. To counteract this threat, we introduce FEAT, a computationally efficient adversarial training method that leverages embedding-space perturbations to improve robustness without incurring high computational costs. Our empirical evaluation demonstrates that FEAT reduces the adversarial attack success rate from 100% to just 5% while preserving detection accuracy on clean inputs. Our findings highlight the urgent need for improved adversarial defenses in prompt guard models and establish a foundation for more secure LLM applications.
Paper Type: Long
Research Area: Machine Learning for NLP
Research Area Keywords: Machine Learning for NLP,Ethics, Bias, and Fairness
Languages Studied: English
Submission Number: 4709
Loading