Breaking Safety Alignment in Large Vision-Language Models via Benign-to-Harmful Optimization

ICLR 2026 Conference Submission8424 Authors

17 Sept 2025 (modified: 28 Nov 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Vision-Language Models (LVLM), Safety-Alignment, Jailbreeak
TL;DR: We found that decoupling conditioning and targets (i.e., ensuring the target is not the next-token continuation of the conditioning) induces safety misalignment, and we propose a jailbreak method that leverages this principle.
Abstract: Large vision–language models (LVLMs) achieve remarkable multimodal reasoning capabilities but remain vulnerable to jailbreaks. Recent studies show that a single jailbreak image can universally bypass safety alignment, yet most existing methods rely on Harmful-Continuation (H-Cont.) optimization. In this setting, a jailbreak image is optimized to predict the next token from harmful conditioning. Through systematic analysis, we reveal that H-Cont. has a fundamental limitation. Specifically, harmful conditioning itself biases models toward unsafe outputs, leaving limited capacity for adversarial optimization to genuinely overturn refusals. Consequently, H-Cont. is effective only in continuation-based jailbreak settings and fails to exhibit universal effectiveness across diverse user inputs. To address this limitation, we propose Benign-to-Harmful (B2H) optimization, a new jailbreak paradigm that decouples conditioning and targets (i.e., the target is not the next-token continuation of the conditioning). By explicitly forcing models to map benign conditioning to harmful targets, B2H directly breaks safety alignment rather than merely extending harmful conditioning. Extensive experiments across multiple LVLMs and safety benchmarks demonstrate that B2H achieves stronger and more universal jailbreak success, while preserving the intended jailbreak behavior. Moreover, B2H transfers well in black-box settings, integrates with text-based jailbreaks, and remains robust under common defense mechanisms. Our findings highlight fundamental weaknesses in current LVLM alignment and establish B2H as a simple yet powerful paradigm for studying multimodal jailbreak vulnerabilities.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 8424
Loading