Adversarial Attention Deficit: Fooling Deformable Vision Transformers with Collaborative Adversarial Patches

Published: 2025, Last Modified: 17 Mar 2026WACV 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Deformable vision transformers reduce the expensive quadratic time-complexity of attention modeling by using sparse attention structures, making it possible to use transformers in large-scale vision applications, such as multiview vision systems. We show that existing adversarial attacks against conventional vision transformers do not transfer to deformable transformers, primarily due to the data-dependent, dynamic nature of sparse attention. In this work, we present for the first time, adversarial attacks against deformable vision transformers by getting control of their attention-inferring module. We develop a novel collaborative attack where a source patch manipulates attention to point to a target patch containing the adversarial noise, which fools the model. We observe that our attack alters less than 1% of the patched area in the input field, completely disrupting object detection and resulting in 0% AP in single-view object detection using MS COCO, and 0% MODA in multi-view object detection using Wildtrack.
Loading