GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning

Zhen Xiang; Linzhi Zheng; Yanjie Li; Junyuan Hong; Qinbin Li; Han Xie; Jiawei Zhang; Zidi Xiong; Chulin Xie; Carl Yang; Dawn Song; Bo Li

GuardAgent: Safeguard LLM Agents via Knowledge-Enabled Reasoning

Zhen Xiang, Linzhi Zheng, Yanjie Li, Junyuan Hong, Qinbin Li, Han Xie, Jiawei Zhang, Zidi Xiong, Chulin Xie, Carl Yang, Dawn Song, Bo Li

Published: 01 May 2025, Last Modified: 23 Jul 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0

TL;DR: We propose GuardAgent, the first LLM agent as a guardrail to protect other LLM agents via knowledge-enabled reasoning.

Abstract: The rapid advancement of large language model (LLM) agents has raised new concerns regarding their safety and security. In this paper, we propose GuardAgent, the first guardrail agent to protect target agents by dynamically checking whether their actions satisfy given safety guard requests. Specifically, GuardAgent first analyzes the safety guard requests to generate a task plan, and then maps this plan into guardrail code for execution. By performing the code execution, GuardAgent can deterministically follow the safety guard request and safeguard target agents. In both steps, an LLM is utilized as the reasoning component, supplemented by in-context demonstrations retrieved from a memory module storing experiences from previous tasks. In addition, we propose two novel benchmarks: EICU-AC benchmark to assess the access control for healthcare agents and Mind2Web-SC benchmark to evaluate the safety policies for web agents. We show that GuardAgent effectively moderates the violation actions for different types of agents on these two benchmarks with over 98% and 83% guardrail accuracies, respectively. Project page: https://guardagent.github.io/

Lay Summary: As AI agents grow more powerful, they’re increasingly used in sensitive areas like healthcare and web automation. But how can we ensure they follow safety rules, such as avoiding private patient data or unsafe websites? Traditional safeguards fall short because they focus on filtering text, not regulating actions. We introduce GuardAgent, a system that interprets safety rules and generates code to monitor and block unsafe behavior. It uses a large language model and draws on past examples to adapt to new tasks. Tested in healthcare and web scenarios, GuardAgent achieved high accuracy and shows promise as a flexible, low-overhead safety layer for responsible AI use.

Link To Code: https://github.com/guardagent/code

Primary Area: Deep Learning->Large Language Models

Keywords: agent, large language model, guardrail, reasoning

Submission Number: 15081

Loading