Abstract: Multimodal large language models unify visual perception with natural language understanding, yet remain vulnerable to adversarial manipulations. Existing jailbreak attacks exploit vision-text vulnerabilities through pixel-space perturbations and prompt optimization,
overlooking a fundamental weakness: the modality gap—the geometric separation between
image and text embeddings. We present Adaptive Modality Gap Exploitation (AMGE), operating within the embedding manifold through gap-aware perturbation optimization and
cross-attention-mediated gradient flow. Our framework characterizes the modality gap via
empirical directional bias estimation, formulates attacks as geometric exploitation where
gradient updates align with gap vectors, and employs momentum-based ensemble aggregation for universal transferability across queries and architectures. Evaluation across four
multimodal LLMs (LLaVA-1.5-7B/13B, Qwen-VL, Qwen2-VL) demonstrates 90.2% attack
success rate with 79.1% transferability, requiring only 127 queries—3× fewer than competing methods—while maintaining 87.5% semantic preservation. AMGE sustains 62.3%
effectiveness against five defenses, outperforming existing attacks by 23.7%. This work
establishes embedding-space geometric exploitation as a principled paradigm for exposing
vulnerabilities in multimodal alignment architectures.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Jinghui_Chen1
Submission Number: 6329
Loading