GraphPrompt: Black-box Jailbreaks via Adversarial Visual Knowledge Graphs

19 Sept 2025 (modified: 05 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Models, Black-box Jailbreak Attacks, Visual Knowledge Graphs, Cross-modal Safety Alignm
Abstract: Multimodal Large Language Models (MLLMs) introduce structured visual interaction paradigms into conversational systems, where Visual Knowledge Graphs (VKGs) are emerging as a primary input modality that models can directly parse and manipulate. VKGs significantly enhance models' ordered reasoning and planning capabilities by explicitly encoding semantic topological relationships and task workflows. However, this advancement also introduces new security attack surfaces: when sensitive or malicious intent is decomposed and implicitly encoded within graph topology and visual style cues, and further paired with surface-neutral textual descriptions, MLLMs may bypass traditional text-based safety filters and follow covert parse-then-execute pathways, exhibiting jailbreak behaviors such as instruction hiding and ambiguity amplification. The safety implications of such structured visual inputs for MLLMs nevertheless remain largely unexplored. To systematically assess this risk, we introduce GraphPrompt, a black-box jailbreak evaluation framework that exploits this attack surface through a three-layer obfuscation pipeline: (1) role-play rewriting masks harmful queries as benign tasks; (2) knowledge graph encoding decomposes procedures into entity–relation structures; and (3) visual rendering transforms graphs into adversarial VKG images. This framework automatically generates high-quality adversarial datasets while providing standardized evaluation. Systematic experiments on six state-of-the-art MLLMs reveal alarming safety risks: GraphPrompt achieves a 94\% average attack success rate with only 1.25 attempts per query on average. Ablation studies identify graph complexity and image resolution as first-order attack factors, while visual styling has minimal impact. Layer-wise analysis demonstrates that VKG inputs effectively suppress activation in safety-critical layers, providing mechanistic evidence for their jailbreak efficacy. Overall, our work establishes structured visual inputs as an under-explored attack surface and offers a reproducible framework for developing structure-aware defenses.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19665
Loading