Visual Exclusivity Attacks: Automatic Multimodal Red Teaming via Agentic Planning
Keywords: red teaming, jailbreak attack, MLLMs, benchmark
TL;DR: We jailbreak VLMs by planning multi-turn attacks around visual content where harm requires reasoning over images, not embedded text.
Abstract: Current multimodal red teaming treats images as wrappers for malicious payloads via typography or adversarial noise. These attacks are structurally brittle, as standard defenses neutralize them once the payload is exposed. We introduce Visual Exclusivity (VE), a more resilient Image-as-Basis threat where harm emerges only through reasoning over visual content such as technical schematics. To systematically exploit VE, we propose Multimodal Multi-turn Agentic Planning (MM-Plan), which reframes jailbreaking from turn-by-turn reaction to global plan synthesis. MM-Plan trains an attacker planner optimized via Group Relative Policy Optimization (GRPO), enabling self-discovery of effective strategies without human supervision. We introduce VE-Safety, a human-curated dataset of 440 instances spanning 15 safety categories. MM-Plan achieves 46.3% attack success rate against Claude 4.5 Sonnet and 13.8% against GPT-5, outperforming baselines by 2--5$\times$. Warning: This paper contains potentially harmful content.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 116
Loading