TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories
Keywords: Tool-Calling, Agentic Safety, Guardrails, Trajectory-Level Assessment
TL;DR: We present TraceSafe, a systematic assessment of 13 LLM-as-a-guards and 7 specialized guardrails. We find that a guardrail's efficacy in detecting mid-trajectory harm depends on its structural data understanding, not its semantic safety alignment.
Abstract: As LLMs transition into autonomous agents, vulnerabilities shift from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains unexplored within multi-step tool-use trajectories. To address this gap, we introduce **TraceSafe-Bench**, which ecompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 20 models yields three critical findings: 1) *Structural Bottleneck*: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. 2) *Architecture over Scale*: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs outperforming specialized safety guardrails in trajectory analysis. 3) *Temporal Stability*: Accuracy remains resilient across longer context trajectories. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 255
Loading