Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Shoumik Saha; Jifan Chen; Sam Mayers; Sanjay Krishna Gouda; Zijian Wang; Varun Kumar

Breaking the Code: Security Assessment of AI Code Agents Through Systematic Jailbreaking Attacks

Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, Varun Kumar

18 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: ai, agent, code, jailbreak, security, attack, judge

TL;DR: We introduce CodeAgentJail-Bench -- spanning empty, single-file, and multi-file workspaces -- together with a hierarchical, executable-aware judge pipeline, showing code agents often compile and run malicious code.

Abstract: Code-capable large language model (LLM) agents are increasingly embedded into software engineering workflows where they can read, write, and execute code, raising the stakes of safety-bypass (“jailbreak”) attacks beyond text-only settings. Prior evaluations emphasize refusal or harmful-text detection, leaving open whether agents actually compile and run malicious programs. We present **CodeAgentJail**, a benchmark spanning three escalating workspace regimes that mirror attacker capability: empty (CAJ-0), single-file (CAJ-1), and multi-file (CAJ-M). We pair this with a hierarchical, executable-aware **Judge Framework** that tests (i) compliance, (ii) attack success, (iii) syntactic correctness, and (iv) runtime executability, moving beyond refusal to measure deployable harm. Using seven LLMs from five families as backends, we find that under prompt-only conditions in CAJ-0, code agents accept 61\% of attacks on average; 58\% are harmful, 52\% parse, and 27\% run end-to-end. Moving to single-file regime in CAJ-1 drives compliance to $\sim$100\% for capable models and yields a mean ASR (*Attack Success Rate*) $\approx$71\%; the multi-file regime (CAJ-M) raises mean ASR to $\approx$75\%, with 32\% being instantly deployable attack code. Across models, wrapping an LLM in an agent substantially increases vulnerability -- ASR raises by 1.6$\times$ -- by frequently overturning initial refusals during planning/tool-use steps. We further observe similar jailbreak trends when replacing OpenHands with SWE-Agent and OpenAI Codex, suggesting that CodeAgentJail is agent-agnostic. Category-level analyses identify which attack classes are most vulnerable and most readily deployable, while others exhibit large execution gaps. These findings motivate execution-aware defenses, code-contextual safety filters, and mechanisms that preserve refusal decisions throughout the agent’s multi-step reasoning and tool use.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 14174

Loading