Modality-Specific Interactive Attack for Vision-Language Pre-Training Models

Haiqi Zhang, Hao Tang, Yanpeng Sun, Shengfeng He, Zechao Li

Published: 01 Jan 2025, Last Modified: 24 Jul 2025IEEE Trans. Inf. Forensics Secur. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: Recent advances have heightened the interest in the adversarial transferability of Vision-Language Pre-training (VLP) models. However, most existing strategies constrained by two persistent limitations: suboptimal utilization of cross-modal interactive information, and inherent discrepancies across hierarchical textual representation. To address these challenges, we propose the Modality-Specific Interactive Attack (MSI-Attack), a novel approach that integrates semantic-level image perturbations with embedding-level text perturbations, all while maintaining minimal inter-modal constraints. In our image attack methodology, we introduce Multi-modal Integrated Gradients (MIG) to guide perturbations toward the core semantics of images, enriched by their associated deeply text information. This technique enhances transferability by capturing consistent features across various models, thereby effectively misleading similar-model perception areas. Additionally, we employ a momentum iteration strategy in conjunction with MIG, which amalgamates current and historical gradients to expedite the perturbation updates. For text attacks, we streamline the perturbation process by operating exclusively at the embedding level. This reduces semantic gaps across hierarchical structures and significantly enhances the generalizability of adversarial text. Moreover, we delve deeper into how semantic perturbations with varying degrees of similarity affect the overall attack effectiveness. Our experimental results on image-text retrieval tasks using the multi-modal datasets Flickr30K and MSCOCO underscore the efficacy of MSI-Attack. Our method achieves superior performance, setting a new state-of-the-art benchmark, all without the need for additional mechanisms.