Keywords: Monitoring, LLMs, AI Safety, Decomposition Attacks, Jailbreak, LLM Agents
TL;DR: We show that simple lightweight sequential monitors can effectively block decomposition attacks in LLM agents, achieving up to a 91% defense success rate-beating heavyweight LLMs as monitors, suggesting the practicality of our approach in deployment.
Abstract: As LLMs become more agentic, a critical risk emerges: attackers can decompose harmful goals into stateful, benign subtasks that trick LLM agents into executing them without realizing the harmful intent in the same context. The challenge lies in the existing shallow safety alignment techniques: they only detect harm in the immediate prompt and do not reason about long-range intent. We therefore propose adding an external monitor that observes the conversation at a higher level. To facilitate our study on monitoring decomposition attacks, we curate the largest and most diverse dataset, DecomposedHarm, with 4,634 tasks that can be assigned to LLM agents, including general agent tasks, text-to-image, and question-answering tasks, where each task has a benignly decomposed version. We verify our datasets by testing them on frontier models and show an 87\% attack success rate on average on GPT-4o. To defend in real‐time, we propose a lightweight sequential monitoring framework that cumulatively evaluates each sub‑prompt. We show that a carefully prompt-engineered lightweight monitor hits a 93\% defense success rate—outperforming strong baselines such as Llama-Guard-4 and o3-mini, while cutting costs by 90\% and latency by 50\%. Additionally, we show that even under adversarial pressure, combining decomposition attacks with massive random task injection and automated red teaming, our lightweight sequential monitors remain robust. Our findings suggest that guarding against stateful decomposition attacks is "surprisingly easy" with lightweight sequential monitors, enabling safety in real-world LLM agent deployment where expensive solutions are impractical.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 12596
Loading