Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector

Youcheng Huang; Fengbin ZHU; Jingkun Tang; Pan Zhou; Wenqiang Lei; Tat-Seng Chua

Effective and Efficient Adversarial Detection for Vision-Language Models via A Single Vector

Youcheng Huang, Fengbin ZHU, Jingkun Tang, Pan Zhou, Wenqiang Lei, Tat-Seng Chua

26 Sept 2024 (modified: 14 Nov 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visual Language Models, Adversarial Attacks, Attacking Directions, Adversarial Defense, Detection of Adversarial Samples

TL;DR: To analyze and defend adversarial attacks to visual language models (VLMs), we present a new dataset; and a novel detection method by exploring VLMs' attacking directions that is effective, efficiency, and obtains cross-model tranferrability.

Abstract: Visual Language Models (VLMs) are vulnerable to adversarial attacks, especially those from adversarial images, which is however under-explored in literature. To facilitate research on this critical safety problem, we first construct a new la**R**ge-scale **A**dervsarial images dataset with **D**iverse h**A**rmful**R**esponses (RADAR), given that existing datasets are either small-scale or only contain limited types of harmful responses. With the new RADAR dataset, we further develop a novel and effective i**N**-time**E**mbedding-based**A**dve**RS**arial **I**mage **DE**tection (NEARSIDE) method, which exploits a single vector that distilled from the hidden states of VLMs, which we call *the attacking direction*, to achieve the detection of adversarial images against benign ones in the input. Extensive experiments with two victim VLMs, LLaVA and MiniGPT-4, well demonstrate the effectiveness, efficiency, and cross-model transferrability of our proposed method. Our code is included in the supplementary file and will be made publicly available.

Supplementary Material: zip

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 6593

Loading