Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries

Published: 09 Oct 2024, Last Modified: 03 Jan 2025Red Teaming GenAI Workshop @ NeurIPS'24 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Jailbreaks, Red Teaming, Multimodal LLM, Adversarial Attack, Vision Language Model, AI Safety
TL;DR: This paper explores four jailbreaks against several frontier multimodal LLMs consisting of multiple images.
Abstract: Large Language Models have been extensively studied for their vulnerabilities, particularly in the context of adversarial attacks. However, the emergence of Vision Language Models introduces new modalities of risk that have not yet been thoroughly explored, especially when processing multiple images simultaneously. In this paper, we introduce two black-box jailbreak methods that leverage multi-image inputs to uncover vulnerabilities in these models. We present a new safety evaluation dataset for multimodal LLMs called MultiBench, which is composed of these jailbreak methods. These methods can easily be applied and evaluated using our toolkit. We test these methods against six safety aligned frontier models from Google, OpenAI, and Anthropic, revealing significant safety vulnerabilities. Our findings suggest that even the most powerful language models remain vulnerable against compositional adversarial attacks, specifically those composed of multiple images.
Serve As Reviewer: jbroomfield9@gatech.edu
Submission Number: 54
Loading