Keywords: Jailbreak Attack, Vision-Language Models
Abstract: Vision-language models (VLMs) remain susceptible to jailbreak attacks that bypass safety alignment, leading to harmful outputs. Through an analysis of the outputs of existing optimization-based jailbreak attacks, we identify two key failure cases, including the image hallucination phenomenon, where models generate irrelevant image hallucinations instead of responding to malicious prompts, and the refusal phenomenon, in which models give rejected responses. To mitigate these issues, we propose an **I**mproved **V**isual **J**ailbreak **A**ttack (**$\mathcal{I}$-VJA**) that introduces an image hallucination suppression loss to reduce irrelevant image hallucinations and a refusal suppression loss to mitigate rejected responses. In addition, to enable a single jailbreak image to generalize across diverse malicious prompts, $\mathcal{I}$-VJA jointly optimizes the image with learnable textual prompts when increasing the likelihood of malicious responses. Extensive experiments show that $\mathcal{I}$-VJA achieves high jailbreak success rates across three open-source VLMs on three benchmark datasets, and remains effective against three commercial VLMs.
Paper Type: Long
Research Area: Ethics, Bias, and Fairness
Research Area Keywords: Jailbreak Attack, Vision-Language Models
Languages Studied: English
Submission Number: 3022
Loading