VideoJail: Exploiting Video-Modality Vulnerabilities for Jailbreak Attacks on Multimodal Large Language Models

Published: 05 Mar 2025, Last Modified: 15 Apr 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0
Track: Long Paper Track (up to 9 pages)
Keywords: Jailbreak Attack, Video LLMs
TL;DR: This paper explores the impact of the video modality on the safety alignment of MLLMs.
Abstract: With the rapid development of multimodal large language models (MLLMs), an increasing number of models focus on video understanding capabilities, while overlooking the security implications of the video modality. Previous studies have highlighted the vulnerability of MLLMs to jailbreak attacks in the image modality. This paper explores the impact of the video modality on the secure alignment of MLLMs. We conduct a systematic empirical analysis of the harmlessness performance of representative MLLMs, revealing vulnerabilities introduced by video input. Motivated by these findings, we propose a novel jailbreak method, VideoJail, which leverages video generation models to amplify harmful content in images. By using carefully crafted text prompts, VideoJail directs the model's attention to malicious queries embedded within the video, successfully breaking through existing defense mechanisms. Experimental results show that VideoJail is highly effective in jailbreaking even the most advanced open-source MLLMs, achieving an average attack success rate (ASR) of 96.53\% for LLaVA-Video and 96.00\% for Qwen2-VL. For closed-source MLLMs with harmful visual content detection capabilities, we take advantage of the dynamic characteristics of the video modality, using a jigsaw-based approach to cleverly bypass their secure alignment mechanisms, achieving an average attack success rate of $92.13\%$ for Gemini-1.5-flash.
Submission Number: 54
Loading