Fit the Distribution: Cross-Image/Prompt Adversarial Attacks on Multimodal Large Language Models

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Large Language Models; Adversarial attack; Evaluation
Abstract: Although Multimodal Large Language Models (MLLMs) have demonstrated remarkable achievements in recent years, they remain vulnerable to adversarial examples that result in harmful responses. Existing attacks typically focus on optimizing adversarial perturbations for a certain multimodal image-prompt pair or fixed training dataset, which often leads to overfitting. Consequently, these perturbations fail to remain malicious once transferred to attack unseen image-prompt pairs, suffering from significant resource costs to cover the diverse multimodal inputs in complicated real-world scenarios. To alleviate this issue, this paper proposes a novel adversarial attack on MLLMs based on distribution approximation theory, which models the potential image-prompt input distribution and adds the same distribution-fitting adversarial perturbation on multimodal input pairs to achieve effective cross-image/prompt transfer attacks. Specifically, we exploit the Laplace approximation to model the Gaussian distribution of the image and prompt inputs for the MLLM, deriving an estimate of the mean and covariance parameters. By sampling from this approximated distribution with Monte Carlo mechanism, we efficiently optimize and fit a single input‑agnostic perturbation over diverse image‑prompt pairs, yielding strong universality and transferability. Extensive experiments are conducted to verify the strong adversarial capabilities of our proposed attack against prevalent MLLMs spanning a spectrum of images/prompts.
Primary Area: Evaluation (e.g., methodology, meta studies, replicability and validity, human-in-the-loop)
Submission Number: 4817
Loading