Unsafe Only in Combination: Interaction-Barrier Shielding for Tool-Using LLM Agents

Published: 27 May 2026, Last Modified: 27 May 2026CompLearn 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: compositional learning, LLM agent safety, prompt injection defense, tool-using agents, capability interaction modeling
TL;DR: We show that many agent attacks are unsafe only in capability combinations, and introduce an interaction-barrier shield that blocks risky compositions while preserving more task utility than flow-only defenses.
Abstract: Tool-using LLM agents fail in ways that are poorly captured by atomic safety tests. In indirect prompt injection, an untrusted message, a private database, and an outbound tool may each be useful in isolation, yet their composition enables exfiltration or unauthorized action. We introduce CCL-Bench, a capability-lattice evaluation protocol that converts existing agent-security benchmarks into matched counterfactual variants. The protocol estimates higher-order safety interactions such as the superadditive risk of untrusted context, private data, and external sinks. We then propose CAPSPLIT-IB, an interaction-barrier shield that models unsafe capability compositions as weighted hyperedges and repairs unsafe tool calls through a constrained minimum-cost intervention. Across AgentDojo, InjecAgent, and held-out AgentDyn tasks, CAPSPLIT-IB reduces ASR from 27.03\% (NoDefense) to 1.18\% in the main aggregate, while maintaining low ODR among defended methods and improving context-dependent utility over flow-based baselines. These results support a compositional view of agent safety: useful capabilities can be safe alone but unsafe in combination.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 194
Loading