Attention Misalignment Attacks: Targeting Cross-Modal Attention in Multimodal Large Language Models for Adversarial Examples

Attention Misalignment Attacks: Targeting Cross-Modal Attention in Multimodal Large Language Models for Adversarial Examples

ICLR 2026 Conference Submission17985 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Adversarial Attack, Multimodal Large Language Model, Attention Mechanism

TL;DR: A novel adversarial attack paradigm targeting cross-modal attention in multimodal large language models, improving attack performance towards detailed attacking target.

Abstract: Multimodal large language models (MLLMs) have achieved impressive performance across a wide range of multimodal understanding tasks. However, their growing deployment raises concerns about robustness under adversarial conditions. Existing adversarial attacks on MLLMs predominantly focus on disrupting the global semantic alignment between image and text by optimizing over joint embeddings or globally aggregated image/text token representations. We observe that such methods often fail to generate effective adversarial examples for fine-grained tasks such as VQA (Visual Question Answering), especially when the questions aim at detailed understanding of particular regions in the image, which requires precise alignment between image regions and textual answers for MLLMs. To address this, we propose Attention Misalignment Attack (AMA), a novel plug-and-play attack method that is highly compatible with existing attack objectives—it can be easily integrated by combining its attention misalignment loss with other attack losses. AMA operates by extracting attention maps from each decoding step of the MLLM and optimizing the divergence between target and adversarial attention patterns, guided by semantic similarity. This forces the model to attend to irrelevant regions, effectively misguiding its answer generation process even towards fine-grained questions. To improve efficiency, we further introduce FastAMA, a lightweight variant that avoids autoregressive decoding and instead uses a single forward pass to extract self-attention from the input tokens. Experiments show that our method significantly enhances the performance of existing attack methods across multiple tasks, especially on the more challenging instances within VQA datasets.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 17985

Loading