Keywords: Jailbreak, AI Safety, Large Vision-Language Models
TL;DR: We propose a novel straightforward yet effective black-box jailbreak method against LVLMs.
Abstract: Although large vision-language models (LVLMs) demonstrate powerful capabilities across various tasks, their generated content still poses significant safety risks. Jailbreak attacks against LVLMs help uncover potential safety vulnerabilities in these models, guiding developers to build more robust safety guardrails. Existing black-box jailbreak attacks primarily exploit the weak capability of LVLMs to detect harmful information in the visual modality. These attacks transfer harmful intent from text to images, constructing "benign text + harmful image'' combinations to bypass LVLMs' safety guardrails. In this paper, we reveal a novel safety vulnerability: LVLMs' responses are highly susceptible to visual information manipulation. Leveraging this property, we demonstrate that even when explicit harmful questions are present in the textual modality, it is still possible to effectively bypass LVLMs' safety guardrails. To this end, we propose a novel black-box jailbreak method called visual cues-induced attack (VCI). Different from prior methods that typically disguise harmful intent, VCI directly inputs complete harmful questions in the textual modality and requires LVLMs to infer answers based on the provided image, exploiting the visual cues embedded in the image to induce LVLMs to generate relevant harmful responses. Our method achieves an average attack success rate (ASR) of 77.0\% on eight popular open-source LVLMs and 78.5\% on four mainstream closed-source commercial LVLMs, outperforming existing state-of-the-art (SOTA) methods.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 17833
Loading