GraphPrompt: Black-box Jailbreaks via Adversarial Visual Knowledge Graphs

GraphPrompt: Black-box Jailbreaks via Adversarial Visual Knowledge Graphs

ICLR 2026 Conference Submission19665 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multimodal Large Language Models, Black-box Jailbreak Attacks, Visual Knowledge Graphs, Cross-modal Safety Alignm

Abstract: Multimodal Large Language Models (MLLMs) introduce structured visual interaction paradigms into conversational systems, where Visual Knowledge Graphs(VKGs) are emerging as a primary input modality that MLLMs can directly parse and manipulate. VKGs significantly enhance models’ ordered reasoning and planning capabilities by explicitly expressing semantic topological relationships and task workflows. However, this advancement also introduces new security attack surfaces: when sensitive or malicious intent is decomposed and implicitly encoded within the topological features and visual style cues of the graph structure, combined with surface-neutral textual descriptions, MLLMs may bypass traditional text-based security filters, triggering covert parsing-execution pathways to achieve jailbreaking behaviors like instruction hiding and ambiguity amplification. This paper’s core motivation lies in revealing a critical contradiction yet to be systematically explored: while structured visual inputs enhance model reasoning capabilities and intent accessibility, the visual semantic ambiguity and interpretive uncertainty introduced by graphical encoding paradoxically undermine the effectiveness of existing security detection mechanisms and the robustness of model alignment. To investigate this issue, we propose GraphPrompt—a novel jailbreaking paradigm specifically designed for VKG—and develop a standardized evaluation protocol. Notably, this framework inherently possesses the capability to automatically construct high-quality adversarial sample datasets, thereby also serving as a data generation pipeline. Based on this framework, we conducted systematic VKG-driven jailbreak experiments on multiple mainstream MLLMs. Results reveal widespread security vulnerabilities in current models toward structured visual inputs, with consistently high and significant escape success rates. Further attribution analysis and ablation experiments identify key factors influencing attack effectiveness, including graph scale (number of nodes and edges), and visual encoding strategies (e.g., color schemes, resolution)

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 19665

Loading