Defending Multimodal Large Language Models Against Jailbreak Attacks Through Embedding-Space Adversarial Smoothing
Abstract: Multimodal large language models have achieved unprecedented capabilities in integrating visual perception with natural language understanding. However, jailbreak attacks exploit coordinated vision-text manipulations through typographic prompts, pictorial code,
and multi-modal linkage, achieving attack success rates exceeding 90%. We introduce
Embedding-Space Adversarial Smoothing (ESAS), operating directly on the embedding
manifold through cross-modal coupled interpolation, contrastive safety anchoring, and
lightweight adapter transformation. Our framework synthesizes adversarial embeddings via
gradient-based visual perturbations and text suffix injection, applies Beta-distributed mixing for smoothed manifold trajectories, and leverages safety anchors to attract embeddings
toward safe regions while repelling adversarial zones. Evaluation across seven attacks and
four architectures demonstrates 78.8% attack mitigation, reducing ASR from 79.2% to 16.8%
with 0.6% accuracy drop.ESAS outperforms four state-of-the-art defenses, maintaining ASR
below 20% under perturbations up to epsilon 0.15. This work establishes embedding-space
geometric regularization as a principled paradigm for defending multimodal systems against
cross-modal jailbreak threats.
Submission Type: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Chao_Chen1
Submission Number: 6220
Loading