Jailbreaking Multimodal Large Language Models Through Video Prompts

ICLR 2026 Conference Submission18420 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreak, Multi-modal Large Language Model
Abstract: Multimodal Large Language Models (MLLMs) have achieved significant advancements in various visual reasoning tasks, including image and video understanding. Recent studies have demonstrated several successful methods for jailbreaking MLLMs via the image modality. However, we reveal that image-based attacks are less effective than video-based ones. Simply repeating the same harmful image across multiple frames to form a video can successfully bypass the safety mechanisms of MLLMs. We attribute this to the fact that unsafe videos are embedded more similarly to safe videos in the model’s representation space compared to individual harmful images. Furthermore, videos with identical frames are processed more like images and more readily trigger safety defenses than videos with diverse frames. Building on these insights, we propose an algorithm that injects harmful content into typographic videos by interleaving it with diverse safety-proximal frames, thereby evading the safety detection of MLLMs. Extensive experiments demonstrate that our approach achieves state-of-the-art jailbreaking performance on several widely-used MLLMs (e.g., VideoLLaMA-2, Qwen2.5-VL, GPT-4.1, and Gemini-2.5) across 16 different safety policies.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 18420
Loading