The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models?

Ching-Chia Kao; Chia-Mu Yu; Chun-Shien Lu; Chu-Song Chen

The Great Contradiction Showdown: How Jailbreak and Stealth Wrestle in Vision-Language Models?

Ching-Chia Kao, Chia-Mu Yu, Chun-Shien Lu, Chu-Song Chen

Published: 01 Sept 2025, Last Modified: 18 Nov 2025ACML 2025 Conference TrackEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Vision-Language Models (VLMs) have achieved remarkable performance across various tasks. Unfortunately, due to their multimodal nature, a common jailbreak strategy transforms harmful instructions into visual formats like stylized typography or AI-generated images to bypass safety alignment. Despite numerous heuristic defenses, little research has investigated the underlying rationale behind the jailbreak. In this paper, we introduce an information-theoretic framework to explore the fundamental trade-off between attack effectiveness and stealthiness. Leveraging Fano's inequality, we show that an attacker's success probability intrinsically relates to the stealthiness of the generated prompts. We further propose an efficient algorithm to detect non-stealthy jailbreak attacks. Experimental results highlight the inherent tension between strong attacks and detectability, offering a formal lower bound on adversarial strategies and potential defense mechanisms.

Supplementary Material: pdf

Submission Number: 189

Loading