Jailbreak Large Vision-Language Models Through Multi-Modal Linkage

ACL ARR 2024 December Submission436 Authors

13 Dec 2024 (modified: 05 Feb 2025)ACL ARR 2024 December SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: With the significant advancement of Large Vision-Language Models (VLMs), concerns about their ‌potential misuse and abuse have grown rapidly. Previous studies have highlighted VLMs' vulnerability to jailbreak attacks, where carefully crafted inputs can lead the model to produce content that violates ethical and legal standards. However, existing jailbreak methods struggle against state-of-the-art VLMs like GPT-4o, due to the over-exposure of harmful content and lack of stealthy malicious guidance. In this work, we propose a novel jailbreak attack framework: Multi-Modal Linkage (MML) Attack. Drawing inspiration from cryptography, MML utilizes an encryption-decryption process across text and image modalities to mitigate over-exposure of malicious information. To align the model’s output with malicious intent covertly, MML employs a technique called ``evil alignment'', framing the attack within a video game production scenario. Comprehensive experiments demonstrate MML's effectiveness. Specifically, MML jailbreaks GPT-4o with attack success rates of 97.80\% on SafeBench, 98.81\% on MM-SafeBench and 99.07\% on HADES-Dataset. We will open source code and data in the public version of this manuscript.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: security and privacy
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 436
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview