Abstract: Vision-Language Models (VLMs) have achieved remarkable performance across various tasks. Unfortunately, due to their multimodal nature, a common jailbreak strategy transforms harmful instructions into visual formats like stylized typography or AI-generated images to bypass safety alignment. Despite numerous heuristic defenses, little research has investigated the underlying rationale behind the jailbreak. In this paper, we introduce an information-theoretic framework to explore the fundamental trade-off between attack effectiveness and stealthiness. Leveraging Fano's inequality, we show that an attacker's success probability intrinsically relates to the stealthiness of the generated prompts. We further propose an efficient algorithm to detect non-stealthy jailbreak attacks. Experimental results highlight the inherent tension between strong attacks and detectability, offering a formal lower bound on adversarial strategies and potential defense mechanisms.
Supplementary Material: pdf
Submission Number: 189
Loading