Why does Weak-OOD Help? A Further Step Towards Understanding Jailbreaking VLMs

ICLR 2026 Conference Submission19986 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreak Attack, VLM
TL;DR: VLMs are vulnerable to jailbreaks (OOD methods effective). This study finds "weak-OOD" boosts jailbreak performance (linked to input intent-model refusal trade-off from training-alignment gaps) and designs a new method outperforming baselines.
Abstract: Large Vision-Language Models (VLMs) are susceptible to jailbreak attacks: re- searchers have developed a variety of attack strategies that can successfully by- pass the safety mechanisms of VLMs. Among these approaches, jailbreak meth- ods based on the Out-of-Distribution (OOD) strategy have garnered widespread attention due to their simplicity and effectiveness. This paper further advances the in-depth understanding of OOD-based VLM jailbreak methods. Experimen- tal results demonstrate that jailbreak samples generated via mild OOD strategies exhibit superior performance in circumventing the safety constraints of VLMs—a phenomenon we define as “weak-OOD”. To unravel the underlying causes of this phenomenon, this study takes SI-Attack, a typical OOD-based jailbreak method, as the research object. We attribute this phenomenon to a trade-off between two dominant factors: input intent perception and model refusal triggering. The incon- sistency in how these two factors respond to OOD manipulations gives rise to this phenomenon. Furthermore, we provide a theoretical argument for the inevitabil- ity of such inconsistency from the perspective of discrepancies between model pre-training and alignment processes. Building on the above insights, we draw inspiration from optical character recognition (OCR) capability enhancement—a core task in the pre-training phase of mainstream VLMs. Leveraging this ca- pability, we design a simple yet highly effective VLM jailbreak method, whose performance outperforms that of SOTA baselines. Code is available at Github
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19986
Loading