Keywords: Multimodal large language models, jailbreak attacks, puzzles, weapon assembly, chemical synthesis
Abstract: Despite the significant advancement of Multimodal Large Language Models (MLLMs) in many vision-language understanding tasks, recent research has revealed that MLLMs are susceptible to jailbreak attacks where malicious attackers can bypass the safety alignment of MLLMs by manipulating input data to make MLLMs generate harmful content. Previous jailbreak attacks on MLLMs mainly focus on low-risk scenarios with easily detectable malicious intent. In this paper, we target two high-risk real-world scenarios: weapon assembly and chemical synthesis, and introduce a novel vision-instructed puzzle jailbreak attack, which stealthily embeds harmful intent within cross-modal puzzles. Specifically, we develop a unified pipeline including textual taxonomy generation, visual object decomposition, and vision-instructed puzzle. Following this pipeline, we introduce PuzzleV-JailBench, a novel benchmark covering 144 dangerous weapons and 54 hazardous chemicals. Using this benchmark, we empirically show that state-of-the-art open-source MLLMs (e.g., LLaVA-v1.6, Qwen2-VL, and Deepdeek-VL) and production MLLMs (e.g., GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet) can be induced to generate highly dangerous content for the two high-risk scenarios.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 15830
Loading