$\mathsf{FigStep}$ : Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
Abstract: Large Vision-Language Models (LVLMs) signify a groundbreaking paradigm shift within the Artificial Intelligence (AI)
community, extending beyond the capabilities of Large Language Models (LLMs) by assimilating additional modalities
(e.g., images). Despite this advancement, the safety of LVLMs
remains adequately underexplored, with a potential overreliance on the safety assurances purported by their underlying
LLMs. In this paper, we propose FigStep, a straightforward
yet effective black-box jailbreak algorithm against LVLMs. Instead of feeding textual harmful instructions directly, FigStep
converts the prohibited content into images through typography to bypass the safety alignment. The experimental results indicate that FigStep can achieve an average attack success rate of 82.50% on 6 promising open-source LVLMs. Not
merely to demonstrate the efficacy of FigStep, we conduct
comprehensive ablation studies and analyse the distribution
of the semantic embeddings to uncover that the reason behind
the success of FigStep is the deficiency of safety alignment
for visual embeddings. Moreover, we compare FigStep with
5 text-only jailbreaks and 4 image-based jailbreaks to demonstrate the superiority of FigStep, i.e., negligible attack costs
and better attacking performance. Above all, our work reveals that current LVLMs are vulnerable to jailbreak attacks,
which highlights the necessity of novel cross-modality safety
alignment techniques. Content Warning: This paper contains
harmful model responses.
Loading