Improved Attack Strategies for Optimization-Based Jailbreak against Vision-Language Models

Improved Attack Strategies for Optimization-Based Jailbreak against Vision-Language Models

ACL ARR 2026 January Submission3022 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreak Attack, Vision-Language Models

Abstract: Vision-language models (VLMs) remain susceptible to jailbreak attacks that bypass safety alignment, leading to harmful outputs. Through an analysis of the outputs of existing optimization-based jailbreak attacks, we identify two key failure cases, including the image hallucination phenomenon, where models generate irrelevant image hallucinations instead of responding to malicious prompts, and the refusal phenomenon, in which models give rejected responses. To mitigate these issues, we propose an **I**mproved **V**isual **J**ailbreak **A**ttack (**$\mathcal{I}$-VJA**) that introduces an image hallucination suppression loss to reduce irrelevant image hallucinations and a refusal suppression loss to mitigate rejected responses. In addition, to enable a single jailbreak image to generalize across diverse malicious prompts, $\mathcal{I}$-VJA jointly optimizes the image with learnable textual prompts when increasing the likelihood of malicious responses. Extensive experiments show that $\mathcal{I}$-VJA achieves high jailbreak success rates across three open-source VLMs on three benchmark datasets, and remains effective against three commercial VLMs.

Paper Type: Long

Research Area: Ethics, Bias, and Fairness

Research Area Keywords: Jailbreak Attack, Vision-Language Models

Languages Studied: English

Submission Number: 3022

Loading