Keywords: Multimodal Large Language Models, Black-box Jailbreak Attacks, Visual Knowledge Graphs, Cross-modal Safety Alignm
Abstract: Multimodal Large Language Models (MLLMs) introduce structured visual interaction paradigms into conversational systems, where Visual Knowledge Graphs(VKGs) are emerging as a primary input modality that MLLMs can directly parse and manipulate. VKGs significantly enhance models’ ordered reasoning and planning capabilities by explicitly expressing semantic topological relationships and task workflows. However, this advancement also introduces new security attack surfaces: when sensitive or malicious intent is decomposed and implicitly encoded within the topological features and visual style cues of the graph structure, combined with surface-neutral textual descriptions, MLLMs may bypass traditional text-based security filters, triggering covert parsing-execution pathways to
achieve jailbreaking behaviors like instruction hiding and ambiguity amplification. This paper’s core motivation lies in revealing a critical contradiction yet to be systematically explored: while structured visual inputs enhance model reasoning capabilities and intent accessibility, the visual semantic ambiguity and interpretive uncertainty introduced by graphical encoding paradoxically undermine the effectiveness of existing security detection mechanisms and the robustness of model alignment. To investigate this issue, we propose GraphPrompt—a novel jailbreaking paradigm specifically designed for VKG—and develop a standardized evaluation protocol. Notably, this framework inherently possesses the capability to automatically construct high-quality adversarial sample datasets, thereby
also serving as a data generation pipeline. Based on this framework, we conducted
systematic VKG-driven jailbreak experiments on multiple mainstream MLLMs.
Results reveal widespread security vulnerabilities in current models toward structured visual inputs, with consistently high and significant escape success rates. Further attribution analysis and ablation experiments identify key factors influencing attack effectiveness, including graph scale (number of nodes and edges), and visual encoding strategies (e.g., color schemes, resolution)
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 19665
Loading