Keywords: Multimodal models; LLM; LLM Security
Abstract: Multimodal large language models (MLLMs) generate text by conditioning on heterogeneous inputs such as images and text. We present allusive adversarial examples, a new class of attacks that imperceptibly encode target instructions into non-textual modalities. Unlike prior adversarial examples, these attacks manipulate model outputs without altering the textual instruction. To construct them, we introduce a practical learning framework that leverages cross-modal alignment and exploits the shared latent space of MLLMs. Empirical evaluation on LLaVA, InternVL, Qwen-VL, and Gemma demonstrates that our method produces efficient and effective adversarial examples, uncovering a critical security risk in multimodal systems.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 9994
Loading