Tool-Framing Bypasses LLM Safety: Procedural Abstraction Reduces Refusal Rates by Up to 40 Percentage Points Across Models
Keywords: LLM Agent Safety, Adversarial Robustness, Red Teaming, Function Calling, Jailbreaking, Tool-Use Vulnerabilities, Procedural Abstraction, Trustworthy Machine Learning, Prompt Injection, Safety Evaluation, AI Security
Abstract: Large language models deployed as tool-using agents show sharply reduced safety refusals when harmful requests are reframed as tool operations. We call this "procedural abstraction": decomposing harmful intent into sequences of apparently legitimate API calls. Across three models (GPT-OSS-20b, DeepSeek-Chat, Claude Sonnet 4.5) and 1,040 matched prompt pairs spanning 13 harm categories, tool-framing reduces refusal rates by 10.9–40.2 percentage points under agent-primed conditions. GPT-OSS shows the largest drop (85.2% → 45.0%), followed by DeepSeek-Chat (76.9% → 46.3%) and Claude Sonnet 4.5 (80.7% → 69.9%). The effect varies by harm type: operational harms like fraud and discrimination are highly susceptible (compliance rates exceeding 50%), while content-intrinsic harms like hate speech resist decomposition and maintain high refusal rates. All effects are statistically significant (McNemar p < 0.001 for all models after correction for multiple comparisons). To isolate tool-framing as the causal factor, we use a paired design: each prompt pair encodes identical harmful intent, differing only in whether the request is stated directly or expressed through tool invocations. We validate semantic equivalence via independent LLM-as-judge evaluation (100% seed-level, 98.9% expansion-level agreement). These results indicate that safety evaluations focused on plain-text chatbot interactions may systematically underestimate vulnerability in agentic deployments, and that tool-aware safety training is a necessary complement to existing approaches. We release gpt-oss-redteam, a reproducible pipeline for agentic red-teaming of local and API-based models.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 140
Loading