Weight-Level Defenses Improve LLM Agent Adversarial Robustness

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI Safety, AI Agents, Adversarial Defense
TL;DR: ALICE is a representation-level, LoRA LLM-agent defense that maps an agent's completion representation to useful outputs under attack. AgentDojo mean ASR 14.04% → 0.31%, same recipe across five Llama/Qwen backbones.
Abstract: Large language model (LLM) agents have been increasingly used to execute tools. This leads to many concerning risks as these LLMs operate over inputs from sources the user does not control: tool returns, retrieved documents, calendar invites. Indirect prompt injection (Greshake et al., 2023) points out the threat model that attacker-crafted text in any of those channels can redirect the agent to actions the user never requested. Existing defenses either filter inputs and rewrite prompts (brittle to format shift and surface obfuscation), train the model to refuse by default (collapsing utility under attack), or run a second model pass to detect divergence (extra inference cost, over-fires on structured tool returns). To bridge the gap, we introduce ALICE (Activation LoRA for Intent Continuation under Exploitation), a representation-level LoRA trained on paired attacked/benign agent traces: rather than blocking or refusing, it redirects the model's completion-token activations back toward the user's original task whenever an injection is present. ALICE brings significant performance improvement: on AgentDojo's 13-attack four-suite grid with Llama-3.3-70B, ALICE drops the average ASR from 14.04% to 0.31% at 1x inference cost while keeping under-attack utility within 2.2 pp of undefended (41.7% vs. 43.9%). Moreover, the ALICE recipe transfers across five additional Llama/Qwen backbones.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 422
Loading