Keywords: Adversarial Attacks, Multimodal Large Language Models (MLLMs), Vision-Language Models (VLMs), Transferability, Generator-based Attacks, Video Adversarial Attacks
TL;DR: We propose a unified generator-based framework that crafts transferable adversarial attacks for both images and videos by directly targeting the language generation process of Multimodal LLMs
Abstract: While Multimodal Large Language Models (MLLMs) demonstrate remarkable cross-modal reasoning, their core vision-language grounding mechanisms present critical vulnerabilities, particularly in complex video scenarios.
We introduce **CAVALRY**, a unified framework for generating powerful adversarial attacks against both image and video MLLMs.
Our approach introduces two key innovations: **(i)** a paradigm shift from conventional classification-boundary attacks to directly disrupting the generative process, realized through a novel loss that maximizes the likelihood divergence of the ground-truth response and severs the visual-linguistic link; and **(ii)** an efficient, progressive generator trained to produce spatiotemporally coherent perturbations for both dynamic videos and static images.
Comprehensive evaluations on seven state-of-the-art MLLMs, including GPT-4.1, Gemini 2.0, and QwenVL-2.5, validate CAVALRY's superior performance.
Our method outperforms the strongest baselines by an average of 22.8\% on video understanding benchmarks and extends this advantage to static images, proving 34.4\% more effective than prior work.
These results establish CAVALRY as a foundational framework for probing the adversarial robustness of the entire spectrum of modern MLLMs.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 10588
Loading