Severing the Link: A Unified Adversarial Attack on Image and Video MLLMs via Generative Disruption

18 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Adversarial Attacks, Multimodal Large Language Models (MLLMs), Vision-Language Models (VLMs), Transferability, Generator-based Attacks, Video Adversarial Attacks
TL;DR: We propose a unified generator-based framework that crafts transferable adversarial attacks for both images and videos by directly targeting the language generation process of Multimodal LLMs
Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable cross-modal reasoning, their core vision-language grounding mechanisms present critical vulnerabilities, particularly in complex video scenarios. We introduce **CAVALRY**, a unified framework for generating powerful adversarial attacks against both image and video MLLMs. Our approach introduces two key innovations: **(i)** a paradigm shift from conventional classification-boundary attacks to directly disrupting the generative process, realized through a novel loss that maximizes the likelihood divergence of the ground-truth response and severs the visual-linguistic link; and **(ii)** an efficient, progressive generator trained to produce spatiotemporally coherent perturbations for both dynamic videos and static images. Comprehensive evaluations on seven state-of-the-art MLLMs, including GPT-4.1, Gemini 2.0, and QwenVL-2.5, validate CAVALRY's superior performance. Our method outperforms the strongest baselines by an average of 22.8\% on video understanding benchmarks and extends this advantage to static images, proving 34.4\% more effective than prior work. These results establish CAVALRY as a foundational framework for probing the adversarial robustness of the entire spectrum of modern MLLMs.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10588
Loading