UDora: A Unified Red Teaming Framework against LLM Agents by Dynamically Hijacking Their Own Reasoning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: In this work, we present UDora, a unified red teaming framework designed for LLM Agents that dynamically leverages the agent's own reasoning processes to compel it toward malicious behavior.
Abstract: Large Language Model (LLM) agents equipped with external tools have become increasingly powerful for complex tasks such as web shopping, automated email replies, and financial trading. However, these advancements amplify the risks of adversarial attacks, especially when agents can access sensitive external functionalities. Nevertheless, manipulating LLM agents into performing targeted malicious actions or invoking specific tools remains challenging, as these agents extensively reason or plan before executing final actions. In this work, we present UDora, a unified red teaming framework designed for LLM agents that dynamically hijacks the agent's reasoning processes to compel malicious behavior. Specifically, UDora first generates the model’s reasoning trace for the given task, then automatically identifies optimal points within this trace to insert targeted perturbations. The resulting perturbed reasoning is then used as a surrogate response for optimization. By iteratively applying this process, the LLM agent will then be induced to undertake designated malicious actions or to invoke specific malicious tools. Our approach demonstrates superior effectiveness compared to existing methods across three LLM agent datasets. The code is available at https://github.com/AI-secure/UDora.
Lay Summary: Large Language Model (LLM) agents are powerful AI systems capable of executing real-world tasks, such as online shopping, automated email replies, and financial transactions. However, their advanced capabilities also expose them to greater risks from adversarial attacks, which aim to exploit the model’s reasoning to trigger malicious actions. In this work, we introduce a new framework called UDora designed specifically to test and identify vulnerabilities in these AI agents. Unlike previous methods that rely on fixed instructions or simple prompts, UDora dynamically leverages the agent's internal reasoning process. It first captures how the agent reasons through tasks, identifies strategic points within that reasoning to introduce subtle manipulations, and then optimizes these manipulations iteratively. By carefully inserting these targeted perturbations, UDora guides the agent towards performing unintended and potentially harmful actions, such as unauthorized email forwarding or inappropriate financial transactions. We demonstrate UDora's effectiveness across multiple realistic scenarios and datasets, achieving higher success rates than existing methods. Importantly, our approach highlights critical vulnerabilities in real-world AI systems and underscores the need for robust safeguards to ensure the secure and trustworthy deployment of powerful AI agents.
Link To Code: https://github.com/AI-secure/UDora
Primary Area: Deep Learning->Large Language Models
Keywords: adversarial attack; llm agent; jailbreak; reasoning and acting
Flagged For Ethics Review: true
Submission Number: 15515
Loading