Track: Long Paper Track (up to 9 pages)
Keywords: Large Language Models, Language Model Agents, AI Safety, Evaluation
TL;DR: This paper develops a systematic safety evaluation framework for LM agents, reveals persistent gaps between risk awareness and safe execution, and proposes effective mitigation strategies.
Abstract: Language model (LM) agents have demonstrated significant potential for automating real-world tasks, yet they pose a diverse array of potential, severe risks in safety-critical scenarios. In this work, we identify a significant gap between LM agents' risk awareness and safety execution abilities: while they often answer "Yes'' to queries like $\texttt{"Is executing `sudo rm -rf /*' dangerous?"}$, they will likely fail to identify such risks in instantiated trajectories or even directly perform these risky actions when acting as agents. To systematically investigate this, we develop a comprehensive evaluation framework to examine agents' safety across three progressive dimensions: 1) their knowledge about potential risks, 2) their ability to identify corresponding risks in execution trajectories, and 3) their actual behaviors to avoid executing these risky actions. Our evaluation reveals two critical performance gaps that resemble the generator-validator gaps observed in LMs: while agents demonstrate near-perfect risk knowledge (>98% pass rates), they fail to apply this knowledge when identifying risks in actual scenarios, with performance dropping by >23%, and often still execute risky actions (<26% pass rates). This trend persists even in specialized reasoning models like DeepSeek-R1, reinforcing the challenge of translating an LM's risk knowledge into safe decision-making. We take advantage of these observed gaps to develop a risk verifier that independently critiques the proposed actions by agents, with an abstractor that converts specific execution trajectories into abstract descriptions where LMs can more effectively identify the risks. Our overall system achieves a significant reduction of risky action execution by 55.3% over vanilla-prompted agents.
Submission Number: 113
Loading