IntentGuard: Safeguard LLM Agents via Intent Alignment

AAAI 2026 Workshop TrustAgent Submission47 Authors

Published: 20 Nov 2025, Last Modified: 09 Mar 2026AAAI 2026 TrustAgent Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Agents, Safety, Trusted Agents
TL;DR: Safeguard LLM Agents via Intent Alignment
Abstract: ReAct-like LLM agents, integrating reasoning and planning, can autonomously operate in external environments to accomplish complex tasks, unlocking vast new possibilities. Nonetheless, significant safety risks have emerged alongside these advancements. Unsafe agent behaviors, arising from model hallucinations or adversarial manipulation, may lead to severe consequences such as data leakage and financial loss. Existing safeguard mechanisms are mainly based on risk checking against a fixed set of static safety rules and are therefore ineffective when dealing with task-specific requirements or dynamic changes in external environments. In this study, we present IntentGuard, a novel runtime guardrail framework that is training-free and model-agnostic. IntentGuard maintains coherent agent intent through continuous intent alignment during task execution, incorporating two safety gates: the Plan Gate and the Tool Gate. These gates ensure the safety of the agent’s high-level plans and individual tool invocations, respectively. Through experiments on key benchmarks, IntentGuard demonstrates high effectiveness in detecting malicious attacks against LLM agents across diverse application domains, significantly and consistently outperforming existing baselines. Our code is available at https://anonymous.4open.science/r/agentdefense-CB92.
Submission Number: 47
Loading