ReToMe-VA: Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Recent diffusion-based unrestricted attacks generate imperceptible adversarial examples with high transferability compared to previous unrestricted attacks and restricted attacks. However, existing works on diffusion-based unrestricted attacks are mostly focused on images yet are seldom explored in videos. In this paper, we propose the Recursive Token Merging for Video Diffusion-based Unrestricted Adversarial Attack (ReToMe-VA), which is the first framework to generate imperceptible adversarial video clips with higher transferability. Specifically, to achieve spatial imperceptibility, ReToMe-VA adopts a Timestep-wise Adversarial Latent Optimization (TALO) strategy that optimizes perturbations in diffusion models' latent space at each denoising step. TALO offers iterative and accurate updates to generate more powerful adversarial frames. TALO can further reduce memory consumption in gradient computation. Moreover, to achieve temporal imperceptibility, ReToMe-VA introduces a Recursive Token Merging (ReToMe) mechanism by matching and merging tokens across video frames in the self-attention module, resulting in temporally consistent adversarial videos. ReToMe concurrently facilitates inter-frame interactions into the attack process, inducing more diverse and robust gradients, thus leading to better adversarial transferability. Extensive experiments demonstrate the efficacy of ReToMe-VA, particularly in surpassing state-of-the-art attacks in adversarial transferability by more than 14.16\% on average.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Experience] Multimedia Applications, [Generation] Generative Multimedia, [Generation] Social Aspects of Generative AI
Relevance To Conference: This work contributes to the field of multimedia/multimodal processing: Text-to-image diffusion models allow multimedia systems to generate content with unparalleled realism and diversity. However inappropriate applications of diffusion models can cause security threats. To avoid potential risks and further research the unsafe applications of diffusion models, this paper investigates the employment of text-to-image diffusion models to carry out video diffusion-based unrestricted attacks aimed at misleading DNNs deployed in safety-critical scenarios. Specifically, we propose ReToMe-VA, which is the first framework to generate imperceptible adversarial video clips with higher transferability. Specifically, we pose the challenges of the direct application of diffusion-based image adversarial attacks onto video attacks and propose a Time-wise Adversarial Latent Optimization (TALO) strategy. We extend the token merging mechanism which is previously leveraged in image diffusion models to the video attack process to achieve temporal imperceptibility and boost transferability. Through extensive experiments against video CNNs and ViTs, ReToMe-VA demonstrates superior performance in generating imperceptible video adversarial examples with enhanced transferability and thus reveals the insecurity of diffusion model application. The paper calls for the entire community to focus on the compliant use of text-driven diffusion models.
Supplementary Material: zip
Submission Number: 1968
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview