FEAT: Evaluating and Enhancing the Adversarial Robustness of Prompt Guard Models

FEAT: Evaluating and Enhancing the Adversarial Robustness of Prompt Guard Models

ACL ARR 2025 February Submission4709 Authors

15 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Large Language Models (LLMs) are vulnerable to prompt injection attacks, where adversaries manipulate model behavior through malicious inputs. To mitigate these threats, prompt guard models have been introduced as lightweight defenses that filter inputs before reaching the LLM. However, their adversarial robustness remains largely unexplored. In this paper, we investigate the susceptibility of prompt guard models to adversarial attacks and introduce methods to enhance their resilience. We propose a novel adaptive attack, APGA, which jointly optimizes for bypassing prompt guard detection while inducing the LLM to generate targeted responses. Our attack achieves a 100% success rate across multiple guard models, exposing critical vulnerabilities. To counteract this threat, we introduce FEAT, a computationally efficient adversarial training method that leverages embedding-space perturbations to improve robustness without incurring high computational costs. Our empirical evaluation demonstrates that FEAT reduces the adversarial attack success rate from 100% to just 5% while preserving detection accuracy on clean inputs. Our findings highlight the urgent need for improved adversarial defenses in prompt guard models and establish a foundation for more secure LLM applications.

Paper Type: Long

Research Area: Machine Learning for NLP

Research Area Keywords: Machine Learning for NLP,Ethics, Bias, and Fairness

Languages Studied: English

Submission Number: 4709

Loading