Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries

Julius Broomfield; George Ingebretsen; Reihaneh Iranmanesh; Sara Pieri; Ethan Kosak-Hine; Tom Gibbs; Reihaneh Rabbany; Kellin Pelrine

Decompose, Recompose, and Conquer: Multi-modal LLMs are Vulnerable to Compositional Adversarial Attacks in Multi-Image Queries

Julius Broomfield, George Ingebretsen, Reihaneh Iranmanesh, Sara Pieri, Ethan Kosak-Hine, Tom Gibbs, Reihaneh Rabbany, Kellin Pelrine

Published: 10 Oct 2024, Last Modified: 04 Dec 2024NeurIPS 2024 Workshop RBFM PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Jailbreaks, Red Teaming, Multimodal LLM, LLM, Black-box, Adversarial Attack, Vision Language Model, Alignment, Cipher, Payload Splitting, AI Safety

Abstract: Large Language Models (LLMs) have been extensively studied for their vulnerabilities, particularly in the context of adversarial attacks. However, the emergence of Large Vision-Language Models (VLMs) introduces new modalities of risk that have not been thoroughly explored, especially when processing multiple images simultaneously. In this paper, we introduce two black-box jailbreak attacks -- Image Decomposition and our novel Color-Based Substitution Cipher method -- which exploit multi-image inputs to reveal underlying vulnerabilities in aligned VLMs. To evaluate these risks, we present MultiBench, a safety evaluation dataset for multimodal LLMs comprised of 11 specific harmful subcategories and 1,100 prompts --- including evaluations of other multimodal attacks, our total evaluation set is comprised of over 2,200 prompts. We conducted evaluations across 6 frontier models from leading organizations, including GPT4o, GPT4o Mini, Claude Sonnet 3.5, Claude Haiku 3, Gemini Pro 1.5, and Gemini Flash 1.5. Our results suggest that even the most powerful language models remain incredibly vulnerable against compositional adversarial attacks, specifically those composed of multiple images. Moreover, we observed that models with adequate safety mechanisms against harmful queries tended to implement overgeneralized safety responses on similarly encoded benign inputs. Consequently, no model demonstrated robust resilience against both compositional adversarial attacks involving multiple images without excessive defensiveness --- in other words, none of the models were adequately aligned. Our results emphasize the need for improved cross-modal safety alignment, without compromising multi-image understanding.

Submission Number: 34

Loading