A Multimodal Adversarial Attack Method via Frequency Domain Enhancement and Fine-Grained Cross-Modal Guidance

Yaguan Qian, Qinqin Yu, Qiqi Bao, Shouling Ji, Wei Wang, Bin Wang, Zhaoquan Gu, Zhen Lei

Published: 2025, Last Modified: 12 Mar 2026IEEE Trans. Dependable Secur. Comput. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Vision-language pretraining (VLP) models have demonstrated outstanding performance in image-text understanding tasks but remain highly susceptible to transferable adversarial attacks. While ensemble-based guided attacks improve adversarial transferability by increasing the diversity of image-text pairs, they primarily rely on spatial-domain data augmentation, which can lead to model overfitting to image details and limit the generalization capability of attacks. To address this limitation, this study proposes a frequency-domain adjustment-based adversarial attack method that modifies specific frequency components of input images to reduce detail interference and enhance the stability of adversarial examples. Additionally, a fine-grained feature extraction technique is introduced to optimize image-text alignment, further improving the transferability of cross-modal attacks. Experimental results demonstrate that the proposed method achieves superior attack transferability and generalization performance across two major VLP architectures, fusion models and alignment models, as well as multiple tasks on the Flickr30 K and MSCOCO datasets.
Loading