Keywords: Prompt Injection, Red teaming, Multimodal Security
TL;DR: Our multi-turn red-team pipeline cracks VLM's image safeguards, exposing a rapid patch gap and highlighting the value of shared red-teaming forums.
Abstract: Multimodal jailbreaks in vision–language systems unfold through multi-turn interactions and modality shifts, and frequent interface and policy updates make fixed benchmarks quickly obsolete. We adopt a continuous adversarial auditing stance for image harms using two simple frames. Pre-update, a Setup–Insistence–Override escalation primes helpfulness with benign context and example images, requires image-only output, then overrides residual disclaimers; in an April 2025 GPT-4o case window, SIO yielded 18/33 unsafe images overall. After updates, we use a Caption-Relay Loop that proceeds from a public seed image to a bounded factual caption (300–400 words; no opinions or slurs) and then to image-only generation in a fresh, zero-shot session. Under a consistent safe/benign/unsafe rubric, post-update outcomes trend toward refusals or benign images. To date, CRL has produced at least 25 unsafe images across five harm categories spanning three production VLMs (GPT-4o, Gemini 2.5 Flash, Mistral). These observations show that defenses evolve rapidly even as new red-teaming approaches continue to emerge.
Submission Number: 29
Loading