Solving Puzzles? Jailbreaking Multimodal Large Language Models!

Solving Puzzles? Jailbreaking Multimodal Large Language Models!

ICLR 2026 Conference Submission15830 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal large language models, jailbreak attacks, puzzles, weapon assembly, chemical synthesis

Abstract: Despite the significant advancement of Multimodal Large Language Models (MLLMs) in many vision-language understanding tasks, recent research has revealed that MLLMs are susceptible to jailbreak attacks where malicious attackers can bypass the safety alignment of MLLMs by manipulating input data to make MLLMs generate harmful content. Previous jailbreak attacks on MLLMs mainly focus on low-risk scenarios with easily detectable malicious intent. In this paper, we target two high-risk real-world scenarios: weapon assembly and chemical synthesis, and introduce a novel vision-instructed puzzle jailbreak attack, which stealthily embeds harmful intent within cross-modal puzzles. Specifically, we develop a unified pipeline including textual taxonomy generation, visual object decomposition, and vision-instructed puzzle. Following this pipeline, we introduce PuzzleV-JailBench, a novel benchmark covering 144 dangerous weapons and 54 hazardous chemicals. Using this benchmark, we empirically show that state-of-the-art open-source MLLMs (e.g., LLaVA-v1.6, Qwen2-VL, and Deepdeek-VL) and production MLLMs (e.g., GPT-4o, Gemini-1.5-Pro, and Claude-3.5-Sonnet) can be induced to generate highly dangerous content for the two high-risk scenarios.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 15830

Loading