Abstract: Recent advancements in Large Vision-Language Models (VLMs) have underscored their superiority in various multimodal tasks. However, the adversarial robustness of VLMs has not been fully explored. Existing methodologies mainly assess robustness through unimodal adversarial attacks that perturb images, while assuming inherent resilience against text-based attacks. In contrast, our methodology adopts a comprehensive strategy that jointly attacks both text and image modalities to exploit a broader spectrum of vulnerability within VLMs. Furthermore, we propose a dual optimization objective aimed at guiding the model to generate affirmative responses with high toxicity. Specifically, we begin by optimizing an adversarial image prefix from random noise to generate diverse harmful responses in the absence of text input, thus imbuing the image with toxic semantics. Subsequently, an adversarial text suffix is integrated and co-optimized with the adversarial image prefix to maximize the probability of eliciting affirmative responses to various harmful instructions. The discovered adversarial image prefix and text suffix are collectively denoted as a Universal Master Key (UMK). When integrated into various malicious queries, UMK can circumvent the alignment defenses of VLMs and lead to the generation of objectionable content, known as jailbreaks. The experimental results demonstrate that our universal attack strategy can effectively jailbreak MiniGPT-4 with a 96% success rate, highlighting the fragility of VLMs and the exigency for new alignment strategies.
Primary Subject Area: [Generation] Social Aspects of Generative AI
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This paper discusses the safety alignment issues of multimodal models by exploring text-image multimodal attack strategies, which can universally jailbreak Large Vision-Language Models (VLMs), achieving an attack success rate of 96% on MiniGPT4.
Submission Number: 2526
Loading