Abstract: Existing cross-domain transferable attacks mostly focus on exploring the adversarial transferability across homomodal domains, while the adversarial transferability across heteromodal domains, e.g., image domains to video domains, has received less attention. This paper investigates cross-modal transferable attacks from image domains to video domains with the generator-oriented approach, i.e., crafting adversarial perturbations for each frame of video clips with the perturbation generator trained in the ImageNet domain to attack target video models. To this end, we propose an effective Generative Cross-Modal Attacks (GCMA) framework to enhance adversarial transferability from image domains to video domains. To narrow the domain gap between image and video data, we first propose a random motion module that warps images with synthetic random optical flows. We then integrate the random motion module into the feature disruption loss to incorporate additional temporal cues in the training phase. Specifically, feature disruption loss minimizes the cosine similarity between intermediate features of warped benign and adversarial images. Furthermore, motivated by the positive correlation between transferability and temporal consistency of adversarial video clips, we also introduce a temporal consistency loss that maximizes the cosine similarity between intermediate features of warped adversarial images and adversarial counterparts of warped benign images. Finally, GCMA trains the perturbation generator by simultaneously optimizing feature disruption loss and temporal consistency loss. Extensive experiments demonstrate the effectiveness of our proposed method, achieving state-of-the-art performance on Kinetics-400 and UCF-101. Our code is available at https://github.com/kay-ck/GCMA.
0 Replies
Loading