MAA: Meticulous Adversarial Attack against Vision-Language Pre-trained Models

Peng-Fei Zhang; Zi Huang; Guangdong Bai

MAA: Meticulous Adversarial Attack against Vision-Language Pre-trained Models

Peng-Fei Zhang, Zi Huang, Guangdong Bai

16 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Adversarial Attack, Vision-Language Pre-trained Models, Robustness

TL;DR: Transferrable adversarial attack for evaluating robustness of vision-language pre-trained models

Abstract: Current adversarial attacks for evaluating the robustness of vision-language pre-trained (VLP) models in multi-modal tasks suffer from limited transferability, where attacks crafted for a specific model often struggle to generalize effectively across different models, limiting their utility in assessing robustness more broadly. This is mainly attributed to the over-reliance on model-specific features and regions, particularly in the image modality. In this paper, we propose an elegant yet highly effective method termed Meticulous Adversarial Attack (MAA) to fully exploit model-independent characteristics and vulnerabilities of individual samples, achieving enhanced generalizability and reduced model dependence. MAA emphasizes fine-grained optimization of adversarial images by developing a novel resizing and sliding crop (RScrop) technique, incorporating a multi-granularity similarity disruption (MGSD) strategy. RScrop efficiently enriches the initial adversarial examples by generating more comprehensive, diverse, and detailed perspectives of the images, establishing a robust foundation for capturing representative and intrinsic visual characteristics. Building on this, MGSD seeks to maximize %the layer- and component-wise feature% the embedding distance between adversarial examples and their original counterparts across different granularities and hierarchical levels within the architecture of VLP models, thereby amplifying the impact of the adversarial perturbations and enhancing the efficacy of attacks across every layer and component of the model. Extensive experiments across diverse VLP models, multiple benchmark datasets, and a variety of downstream tasks demonstrate that MAA significantly enhances the effectiveness and transferability of adversarial attacks. A large cohort of performance studies is conducted to generate insights into the effectiveness of various model configurations, guiding future advancements in this domain. The source code is provided in the supplementary material.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 980

Loading