Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: AI, LLM, Agent, Agent Safety, Agent Graph, Agentic AI, GPT-OSS-20B, Artificial Intelligence, AI Safety, AI Governance, Agent Jailbreak, Jailbreaking
TL;DR: AgentSeer enables deployment-aware jailbreak evaluation by tracing agent executions into action–component graphs to uncover “agentic-only” vulnerabilities missed by model-only testing, and it uses these traces to harden prompts.
Abstract: As large language models increasingly deployed into agentic systems, existing methods face critical gaps in observing, assessing, and mitigating deployment-specific risks. We present a comprehensive, observability-driven workflow: we introduce \textbf{AgentSeer}, observability tool which decomposes agentic executions into granular \emph{action--component} graphs; we use this decomposition to rigorously quantify the gap between model-level and agent-level jailbreaking risk via cross-model validation on GPT-OSS-20B and Gemini-2.0-flash with HarmBench under single-turn and iterative-refinement attacks; we leverage action-graph risk signals to automate iterative prompt hardening against direct and iterative jailbreak attacks. Stark differences is revealed between model-level and agentic-level vulnerability profiles. Model-level evaluation reveals baseline differences: GPT-OSS-20B (39.47\% ASR) versus Gemini-2.0-flash (50.00\% ASR), with both models showing susceptibility to social engineering. However, agentic-level assessment exposes agent-specific risks invisible to traditional evaluation. We discover "agentic-only" vulnerabilities that emerge exclusively in agentic contexts, with tool-calling showing 24--60\% higher ASR across both models. Cross-model analysis reveals universal agentic patterns, where agent transfer operations as highest-risk tools, with semantic pattern revealed rather than syntactic vulnerability mechanisms. Direct attack transfer from model-level to agentic contexts shows degraded performance of successful prompts (GPT-OSS-20B: 57\% human injection ASR; Gemini-2.0-flash: 28\%), while context-aware iterative attacks successfully compromise objectives that failed at model-level, confirming systematic vulnerabilities gaps. Action-based prompt improvement substantially reduces action-averaged agentic jailbreak success on GPT-OSS-20B (direct: 45.3\%$\rightarrow$8.2\%; iterative: 43.0\%$\rightarrow$8.0\%) and partially transfers to Gemini-2.0-flash for direct attacks (16.7\%$\rightarrow$6.4\%). These findings establish the urgent need for deployment-aware, agentic-situation evaluation paradigms, with AgentSeer providing a standardized methodology and empirical validation.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 150
Loading