ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning

Zhaorun Chen; Mintong Kang; Bo Li

ShieldAgent: Shielding Agents via Verifiable Safety Policy Reasoning

Zhaorun Chen, Mintong Kang, Bo Li

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We present ShieldAgent, which safeguards foundation model agents by enforcing policy compliance, and ShieldAgent-Bench, a dataset for evaluating guardrail performance across diverse real-world scenarios.

Abstract: Autonomous agents powered by foundation models have seen widespread adoption across various real-world applications. However, they remain highly vulnerable to malicious instructions and attacks, which can result in severe consequences such as privacy breaches and financial losses. More critically, existing guardrails for LLMs are not applicable due to the complex and dynamic nature of agents. To tackle these challenges, we propose ShieldAgent, the first guardrail agent designed to enforce explicit safety policy compliance for the action trajectory of other protected agents through logical reasoning. Specifically, ShieldAgent first constructs a safety policy model by extracting verifiable rules from policy documents and structuring them into a set of action-based probabilistic rule circuits. Given the action trajectory of the protected agent, ShieldAgent retrieves relevant rule circuits and generates a shielding plan, leveraging its comprehensive tool library and executable code for formal verification. In addition, given the lack of guardrail benchmarks for agents, we introduce ShieldAgent-Bench, a dataset with 3K safety-related pairs of agent instructions and action trajectories, collected via SOTA attacks across 6 web environments and 7 risk categories. Experiments show that ShieldAgent achieves SOTA on ShieldAgent-Bench and three existing benchmarks, outperforming prior methods by 11.3% on average with a high recall of 90.1%. Additionally, ShieldAgent reduces API queries by 64.7% and inference time by 58.2%, demonstrating its high precision and efficiency in safeguarding agents. Our project is available and continuously maintained here: https://shieldagent-aiguard.github.io/

Lay Summary: As artificial intelligence (AI) agents become part of everyday technology, like online assistants that help shop, manage data, or answer questions, keeping these systems safe is increasingly important. However, many current AI “agents” are easily tricked by malicious commands and adversarial attacks, which can lead to data leaks, privacy risks, or financial losses. Existing safety tools often fail because they are designed for simple text models, not for complex, real-world AI agents that interact with websites and other systems over time. Our work introduces ShieldAgent, a new AI safeguard for other AI agents. ShieldAgent watches the actions of these agents and checks them against constitutional safety rules (like government regulations or company policies) using advanced logic and verification tools. This means ShieldAgent doesn’t just look for bad words or obvious mistakes—it carefully reasons about what the agent is trying to do and whether it might break any important rules. To test ShieldAgent, we built a large benchmark of tricky real-world situations—much larger and more realistic than previous tests. In experiments, ShieldAgent caught more unsafe behaviors in general AI agents than any prior guardrail systems and did so more efficiently, saving both time and computing costs. By making AI agents safer and more trustworthy, ShieldAgent can help pave the way for reliable AI systems in daily life.

Application-Driven Machine Learning: This submission is on Application-Driven Machine Learning.

Link To Code: https://shieldagent-aiguard.github.io

Primary Area: Social Aspects->Safety

Keywords: LLM Agent Safety, LLM Guardrail Agent, Policy Compliance, Automated Logic Reasoning

Submission Number: 16287

Loading