On the exploitative behavior of adversarial training against adversarial attacksDownload PDF

Published: 28 Jan 2022, Last Modified: 13 Feb 2023ICLR 2022 SubmittedReaders: Everyone
Abstract: Adversarial attacks have been developed as intentionally designed perturbations added to the inputs in order to fool deep neural network classifiers. Adversarial training has been shown to be an effective approach to improving the robustness of the classifiers against such attacks especially in the white-box setting. In this work, we demonstrate that some geometric consequences of adversarial training on the decision boundary of deep networks give an edge to certain types of black-box attacks. In particular, we introduce a highly parallelizable black-box attack against the classifiers equipped with an $\ell_2$ norm similarity detector, which exploits the low mean curvature of the decision boundary. We use this black-box attack to demonstrate that adversarially-trained networks might be easier to fool in certain scenarios. Moreover, we define a metric called robustness gain to show that while adversarial training is an effective method to improve the robustness in the white-box attack setting, it may not provide such a good robustness gain against the more realistic decision-based black-box attacks.
6 Replies

Loading