A Curious Case of Remarkable Resilience to Gradient Attacks via Fully Convolutional and Differentiable Front End with a Skip Connection

TMLR Paper2293 Authors

26 Feb 2024 (modified: 17 Sept 2024)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: We experimented with front-end enhanced neural models where a frozen backbone classifier was prepended by a differentiable and fully convolutional model with a skip connection. By training such composite models using a small learning rate for one epoch (or less), we obtained models that retained the accuracy of the backbone classifier while being unusually resistant to gradient attacks including APGD and FAB-T attacks from the AutoAttack package. We provided evidence that this was due to gradient masking: Although the gradient masking phenomenon is not new, the degree of masking was quite remarkable for fully differentiable models that did not have gradient-shattering components such as JPEG compression or components that are expected to cause diminishing gradients. The training recipe to pro- duce such models was remarkably stable and reproducible as well: We applied it to three datasets (CIFAR10, CIFAR100, and ImageNet) and several types of models (including re- cently proposed vision Transformers) without a single failure case. Although black box attacks such as the SQUARE attack and the zero-order PGD can be partially effective against gradient masking, these attacks are easily defeated by combining gradient-masking models into simple randomized ensembles. We estimate that these en- sembles achieve near-SOTA AutoAttack accuracy on CIFAR10, CIFAR100, and ImageNet (while retaining virtually all the clean accuracy of the original classifiers) despite having vir- tually zero accuracy under adaptive attacks. Quite interesting, adversarial training of the backbone classifier can further increase resistance of the front-end enhanced model to gra- dient attacks. On CIFAR10, the respective randomized ensemble achieved 90.8±2.5% (99% CI) accuracy under AutoAttack while having only 18.2±3.6% accuracy under the adaptive attack. We do not aim to establish SOTA in adversarial robustness. Instead, our paper makes methodological contributions and further supports the thesis that adaptive attacks designed with the complete knowledge of model architecture are crucial in demonstrating model ro- bustness and that even the so-called white-box gradient attacks can have limited applica- bility. Although gradient attacks can be complemented with black-box attack such as the SQUARE attack or the zero-order PGD, black-box attacks can be weak against randomized ensembles, e.g., when ensemble models mask gradients. Code and instructions to reproduce key results is available. https://anonymous.4open. science/r/curious_case_of_gradient_masking-2D3E
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Pin-Yu_Chen1
Submission Number: 2293
Loading