Mind the Agent: A Comprehensive Survey on Large Language Model-Based Agent Safety

UIUC Spring 2025 CS598 LLM Agent Workshop Submission2 Authors

16 Apr 2025 (modified: 18 Apr 2025)UIUC Spring 2025 CS598 LLM Agent Workshop SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM-based Agent, Safety, Survey
Abstract: The emergence of Large Language Model (LLM)-based agents represents a significant shift in AI systems—from passive language models to autonomous agents equipped with memory, tool-use capabilities, and long-horizon planning. While these agents unlock new possibilities across web automation, embodied robotics, and collaborative systems, they also introduce fundamentally novel safety risks that go beyond traditional LLM vulnerabilities. This survey provides a comprehensive overview of the growing field of LLM-based agent safety. We begin by contrasting LLM agents with standard LLMs, outlining how agent-specific capabilities amplify safety challenges such as execution-based harm, memory poisoning, and emergent failures in multi-agent collaboration. We categorize recent works into four major threat types—adversarial attacks, jailbreaking attacks, backdoor attacks, and multi-agent failures—and systematically examine how each exploits different stages of the agent pipeline. For each threat, we review proposed defense strategies, including robust training, prompt filtering, backdoor deactivation, and adversarial simulation. To evaluate these defenses, we survey the emerging landscape of agent safety benchmarks. We introduce a taxonomy based on attack surface, evaluation targets, and interaction complexity, and compare benchmark coverage across scenarios and models. Finally, we discuss open challenges and future directions, including dynamic and proactive safety evaluation, training-time alignment, and scalable defenses for real-world deployment. Our goal is to provide a structured foundation for advancing the safe and responsible development of LLM-based agents.
Submission Number: 2
Loading