ToolGuard: Red-Teaming Small Language Model Tool Call- ing on Consumer Hardware

ToolGuard: Red-Teaming Small Language Model Tool Call- ing on Consumer Hardware

TMLR Paper9195 Authors

25 May 2026 (modified: 19 Jun 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Small language models (SLMs) are increasingly deployed for tool calling on edge devices and in agentic systems, yet their safety under adversarial conditions remains unstudied. Unlike text generation, tool calling creates a unique attack surface: a single malicious tool call can trigger irreversible real-world actions such as unauthorized financial transfers, data exfiltra- tion, or privilege escalation. We present ToolGuard, to our knowledge the first systematic study of adversarial robustness in SLM tool calling. We contribute: (1) a taxonomy of five attack categories targeting tool-calling SLMs—parameter injection, tool substitution, privi- lege escalation, data exfiltration, and chain attacks; (2) ToolAttackBench, a benchmark of 50 adversarial prompts across 17 tool schemas in 5 domains; (3) an empirical red-team evaluation of four SLM families (1B–3B parameters) over 10 independent runs per prompt, revealing that capable SLMs exhibit attack success rates (ASR) of 47–52%, with a capabil- ity floor for tool-call vulnerability near 1–1.7B parameters; and (4) a runtime defense that enforces declarative security policies on completed tool calls, reducing mean ASR by 76% (from 48.9% to 11.7%) on the full benchmark and 78% (from 49.3% to 10.9%) on a held-out test set (Section 7.2), with 0% false positive rate on simulated canonical benign outputs (n=41; per-model FPR on actual model outputs not measured; 95% Wilson CI: [0%, 8.6%]) and sub-5ms latency overhead. We find that tool substitution is the most dangerous attack category (65.4% mean ASR) but is effectively neutralized by intent verification (reduced to 5.0%), while data exfiltration proves the hardest to defend (44.3% → 31.7%) due to its reliance on semantically valid tool calls. An adaptive evaluation with 8 policy-aware attacks confirms that defense effectiveness drops against knowledgeable adversaries (19.0% defended ASR vs. 10.9% on held-out attacks), underscoring the need for learned defense components.

Submission Type: Long submission (more than 12 pages of main content)

Assigned Action Editor: ~Kamalika_Chaudhuri1

Submission Number: 9195

Loading