Keywords: Visual Language Models, Adversarial Attacks, Attacking Directions, Adversarial Defense, Detection of Adversarial Samples
TL;DR: To analyze and defend adversarial attacks to visual language models (VLMs), we present a new dataset; and a novel detection method by exploring VLMs' attacking directions that is effective, efficiency, and obtains cross-model tranferrability.
Abstract: Visual Language Models (VLMs) are vulnerable to adversarial attacks, especially those from adversarial images, which is however under-explored in literature.
To facilitate research on this critical safety problem, we first construct a new la**R**ge-scale **A**dervsarial images dataset with **D**iverse h**A**rmful**R**esponses (RADAR), given that existing datasets are either small-scale or only contain limited types of harmful responses.
With the new RADAR dataset, we further develop a novel and effective i**N**-time**E**mbedding-based**A**dve**RS**arial **I**mage **DE**tection (NEARSIDE) method, which exploits a single vector that distilled from the hidden states of VLMs, which we call *the attacking direction*, to achieve the detection of adversarial images against benign ones in the input.
Extensive experiments with two victim VLMs, LLaVA and MiniGPT-4, well demonstrate the effectiveness, efficiency,
and cross-model transferrability of our proposed method. Our code is included in the supplementary file and will be made publicly available.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 6593
Loading