ATLAS: Adaptive Topology-Level Attack Synthesis for Probing Multi-Agent Systems
Keywords: AI Security, Multi-agent Adversarial Attack, AI Red Teaming
TL;DR: ATLAS is a topology-aware adversarial attack that probes multi-agent AI systems for security vulnerabilities with 47% fewer queries than the strongest prior baseline.
Abstract: Production multi-agent AI systems delegate tasks, share memory, and call tools across trust boundaries, yet most existing adversarial attacks evaluate one agent at a time. We present ATLAS, an attack algorithm designed for multi-agent topologies. ATLAS profiles defenses across six dimensions, routes attacks through eight structural modes via an MDP with Bellman value propagation, and recovers verbally compliant but unexecuted attacks by retrying through alternative delegation paths. On 25 objectives mapped to the OWASP Top 10 Agentic Risks, across four enterprise scenarios (FinOps, DevSecOps, Healthcare, SOC) and four target models, ATLAS achieves 85.9% attack success rate (ASR) with 5.7 queries per objective, compared with GOAT (82.5%, 10.7q), TAP (75.3%, 9.3q), and Crescendo (63.1%, 10.7q): a 47% reduction in attack budget. All results are scored with tool execution evidence rather than verbal compliance. Across ten models we observe a capability–security inversion: GPT-4.1 (62.9%) is markedly less secure than GPT-4o-mini (38.6%), while Claude Opus
4.6 resists 99.3% of attacks.
Track: Short Paper (4 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 202
Loading