Certifying Robustness of Agent Tool-Selection Under Adversarial Attacks

Certifying Robustness of Agent Tool-Selection Under Adversarial Attacks

ICLR 2026 Conference Submission21962 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Agentic Systems, Certification

Abstract: Large language models (LLMs) are increasingly deployed in agentic systems where they map user intents to relevant external tools to fulfill a task. A critical step in this process is tool selection, where a retriever first surfaces a top-N slate of candidate tools from a large pool, after which the LLM selects the most appropriate one to fulfill a task. This pipeline presents an underexplored attack surface where errors in selection can lead to severe outcomes like unauthorized data access or denial of service, all without modifying the agent's model or code. While existing evaluations measure task performance in benign settings, they overlook the specific vulnerabilities of the tool selection mechanism under adversarial conditions. To address this gap, we introduce Certification of Agentic Tool Selection (CATS), the first statistical framework that formally certifies tool selection robustness. CATS models tool selection as a Bernoulli success process and evaluates it against a strong, adaptive attacker who introduces adversarial tools with misleading metadata, and are iteratively refined based on the agent's previous choices. By sampling these adversarial interactions, CATS produces a high-confidence lower bound on accuracy, formally quantifying the agent's worst-case performance. Our evaluation with CATS uncovers the severe fragility of SOTA LLM agents in tool selection. Under attacks that inject deceptively appealing tools or saturate retrieval results, the certified lower bound on accuracy drops close to zero. This represents an average performance drop of over 60\% compared to non-adversarial settings. For attacks targeting the retrieval and selection stages, the certified accuracy bound plummets to less than 20\% after just a single round of adversarial adaptation. CATS thus reveals previously unexamined security threats inherent to tool selection and provides a principled method to quantify an agent's robustness to such threats, a necessary step for the safe deployment of agentic systems.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 21962

Loading