Locks Tested Without Burglars: Using Coding Assistants to Break Prompt Injection Defenses

Atharv Singh Patlan; Pramod Viswanath; Prateek Mittal

Locks Tested Without Burglars: Using Coding Assistants to Break Prompt Injection Defenses

Atharv Singh Patlan, Pramod Viswanath, Prateek Mittal

Published: 29 Sept 2025, Last Modified: 24 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Adversarial Robustness, AI Coding Assistants, LLM, Security

TL;DR: Coding Assistants can be used to effectively red-team SoTA prompt injection defenses, and should be used in some form for evaluation before defenses are proposed.

Abstract: Prompt injection is a critical security challenge for large language models (LLMs), yet proposed defenses are typically evaluated on toy benchmarks that fail to reflect real adversaries. We show that AI coding assistants, such as Claude Code, can act as automated red-teamers: they parse defense code, uncover hidden prompts and assumptions, and generate adaptive natural-language attacks. Evaluating three recent defenses -- DataSentinel, Melon, and DRIFT -- across standard and realistic benchmarks, we find that assistants extract defense logic and craft attacks that raise attack success rates by up to 50–60\%. These results suggest coding assistants are not just productivity tools but practical adversarial collaborators, and that defenses should be tested against them before claims of robustness are made.

Submission Number: 142

Loading