Track: Tiny Paper Track (between 2 and 4 pages)
Keywords: guardrail, prompt injection, security, llm application
TL;DR: Intent Aware Framework to Benchmark Prompt Guard Models on Over Defense and Difficult Prompt Attacks in LLM-based Applications
Abstract: Prompt injection remains a major security risk for large language models, enabling adversarial manipulation via crafted inputs. Various prompt guardrail models have been developed to mitigate this threat, yet their efficacy in intent-aware adversarial settings remains underexplored. Existing defenses lack robustness in intent-aware adversarial settings, often relying on static attack benchmarks. We introduce a novel intent-aware benchmarking framework that despite taking very few contextual examples as input, diversifies adversarial inputs and assesses over-defense tendencies. Our experiments reveal that current prompt injection guardrail models suffer from high false negatives in adversarial cases and excessive false positives in benign scenarios, highlighting critical limitations.
Submission Number: 136
Loading