Track: tiny paper (up to 4 pages)
Keywords: Visual jailbreak attacks, Loss landscape geometry, Multimodal large language model, Diverse adversarial examples, Transferable attacks
Abstract: Multimodal large language models (MLLMs) are increasingly vulnerable to visual jailbreak attacks, yet existing methods are often brittle, lack diversity, and transfer poorly across architectures. We revisit these limitations from a geometric perspective and introduce \emph{Jailbreak Connectivity (JC)}, which models effective jailbreaks as connected regions of low adversarial loss rather than isolated images. By explicitly constructing continuous paths in the image space, JC generates diverse jailbreaks and exposes structural properties of multimodal vulnerabilities. We further incorporate lightweight surrogate guidance to improve cross-model transferability. Experiments on SafetyBench show that JC substantially outperforms prior methods, achieving an average attack success rate (ASR) of \emph{79.62\%} (\emph{+36.24\%}) and the lowest perplexity (PPL) in most settings. Our results demonstrate the value of connectivity-based analysis for understanding and exploiting visual jailbreak behaviors in MLLMs. Warning: This paper contains data, prompts, and model outputs that are offensive in nature.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 22
Loading