Defending Multimodal Large Language Models Against Jailbreak Attacks Through Embedding-Space Adversarial Smoothing

Defending Multimodal Large Language Models Against Jailbreak Attacks Through Embedding-Space Adversarial Smoothing

TMLR Paper6220 Authors

16 Oct 2025 (modified: 30 Jan 2026)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: Multimodal large language models have achieved unprecedented capabilities in integrating visual perception with natural language understanding. However, jailbreak attacks exploit coordinated vision-text manipulations through typographic prompts, pictorial code, and multi-modal linkage, achieving attack success rates exceeding 90%. We introduce Embedding-Space Adversarial Smoothing (ESAS), operating directly on the embedding manifold through cross-modal coupled interpolation, contrastive safety anchoring, and lightweight adapter transformation. Our framework synthesizes adversarial embeddings via gradient-based visual perturbations and text suffix injection, applies Beta-distributed mixing for smoothed manifold trajectories, and leverages safety anchors to attract embeddings toward safe regions while repelling adversarial zones. Evaluation across seven attacks and four architectures demonstrates 78.8% attack mitigation, reducing ASR from 79.2% to 16.8% with 0.6% accuracy drop.ESAS outperforms four state-of-the-art defenses, maintaining ASR below 20% under perturbations up to epsilon 0.15. This work establishes embedding-space geometric regularization as a principled paradigm for defending multimodal systems against cross-modal jailbreak threats.

Submission Type: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Chao_Chen1

Submission Number: 6220

Loading