Keywords: Adversarial Robustness, AI Coding Assistants, LLM, Security
TL;DR: Coding Assistants can be used to effectively red-team SoTA prompt injection defenses, and should be used in some form for evaluation before defenses are proposed.
Abstract: Prompt injection is a critical security challenge for large language models (LLMs), yet proposed defenses are typically evaluated on toy benchmarks that fail to reflect real adversaries. We show that AI coding assistants, such as Claude Code, can act as automated red-teamers: they parse defense code, uncover hidden prompts and assumptions, and generate adaptive natural-language attacks. Evaluating three recent defenses -- DataSentinel, Melon, and DRIFT -- across standard and realistic benchmarks, we find that assistants extract defense logic and craft attacks that raise attack success rates by up to 50–60\%. These results suggest coding assistants are not just productivity tools but practical adversarial collaborators, and that defenses should be tested against them before claims of robustness are made.
Submission Number: 142
Loading