Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi; Kaixuan Huang; Ashwinee Panda; Mengdi Wang; Prateek Mittal

Visual Adversarial Examples Jailbreak Aligned Large Language Models

Xiangyu Qi, Kaixuan Huang, Ashwinee Panda, Mengdi Wang, Prateek Mittal

Published: 20 Jun 2023, Last Modified: 07 Aug 2023AdvML-Frontiers 2023EveryoneRevisionsBibTeX

Keywords: Adversarial Examples, Visual Language Models, Large Language Models, Foundation Models

TL;DR: We reveal the security and safety implications arising from the vision integration into LLMs.

Abstract: The growing interest in integrating vision into Large Language Models (LLMs), exemplified by Visual Language Models (VLMs) like Flamingo and GPT-4, is steering a convergence of vision and language foundation models. Yet, risks associated with this integration are largely unexamined. This paper sheds light on the security and safety implications of this trend. First, we underscore that the continuous and high-dimensional nature of the additional visual input makes it a weak link against adversarial attacks, representing an expanded attack surface of vision-integrated LLMs. Second, we highlight that the versatility of LLMs also presents visual attackers with a wider array of achievable adversarial objectives, extending the implications of security failures beyond mere misclassification. As an illustration, we present a case study in which we exploit visual adversarial examples to circumvent the safety guardrail of aligned LLMs with integrated vision. To our surprise, we discover that a single visual adversarial example can universally jailbreak an aligned model, inducing it to heed a wide range of harmful instructions and generate harmful content far beyond merely imitating the derogatory corpus used to optimize the adversarial example. Our study underscores the escalating adversarial risks associated with the pursuit of multimodality. More broadly, our findings connect the long-studied fundamental adversarial vulnerabilities of neural networks to the nascent field of AI alignment. The presented attack suggests a fundamental adversarial challenge for AI alignment, especially in light of the emerging trend towards multimodality in frontier foundation models.

Submission Number: 97

Loading