Keywords: Adversarial attacks, vision language models, cross-prompt
TL;DR: Using gradient regularization to enhance cross-prompt adversarial attacks on vision language models.
Abstract: Recent large vision language models (VLMs) have gained significant attention for their superior performance in various visual understanding tasks using textual instructions, also known as prompts.
However, existing research shows that VLMs are vulnerable to adversarial examples, where imperceptible perturbations added to images can lead to malicious outputs, posing security risks during deployment.
Unlike single-modal models, VLMs process both images and text simultaneously, making the creation of visual adversarial examples dependent on specific prompts.
Consequently, the same adversarial example may become ineffective when different prompts are used, which is common as users often input diverse prompts.
Our experiments reveal severe non-stationarity when directly optimizing adversarial example generation using multiple prompts, resulting in examples specific to a single prompt with poor transferability.
To address this issue, we propose the Gradient Regularized-based Cross-Prompt Attack (GrCPA), which leverages gradient regularization to generate more robust adversarial attacks, thereby improving the assessment of model robustness.
By exploiting the structural characteristics of the Transformer, GrCPA reduces the variance of back-propagated gradients in the Attention and MLP components, utilizing regularized gradients to produce more effective adversarial examples.
Extensive experiments on models such as Flamingo, BLIP-2, LLaVA and InstructBLIP demonstrate the effectiveness of GrCPA in enhancing the transferability of adversarial attacks across different prompts.
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6171
Loading