Keywords: LLM Agents, Safety, Risks, Datasets, Benchmarks, Tool-Use, User Interactions, Frameworks
TL;DR: A novel framework to evaluate the safety and risks of LLMs when deployed as agents in workplace scenarios.
Abstract: Recent advances in LLM agents capable of solving complex, everyday tasks, ranging from software engineering to customer service, have enabled deployment in real-world scenarios, but their possibilities for unsafe behavior demands rigorous evaluation. While prior benchmarks have attempted to evaluate safety of LLM agents, most fall short by relying on simulated environments, narrow task domains, or unrealistic tool abstractions. We introduce OpenAgentSafety, a comprehensive and modular framework for evaluating agent behavior across eight critical risk categories. Unlike prior work, our framework evaluates agents that interact with real tools, including web browser, code execution environment, file system, bash terminal, and messaging platform; and supports over 350 multi-turn, multi-user tasks spanning both benign and adversarial user intents. OpenAgentSafety is designed for extensibility, allowing researchers to add tools, tasks, web environments, and adversarial strategies with minimal effort. It combines rule-based evaluation with LLM-as-judge assessments to detect both overt and subtle unsafe behaviors. Empirical analysis of seven prominent LLMs in agentic scenarios reveals unsafe behavior in 49% of safety-vulnerable tasks with Claude Sonnet 4, to 73% with o3-mini, highlighting critical risks and the need for stronger safeguards before real-world deployment of LLM agents.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 21288
Loading