Inequality phenomenon in $l_{\infty}$-adversarial training, and its unrealized threatsDownload PDF

Published: 01 Feb 2023, Last Modified: 19 Feb 2023ICLR 2023 notable top 25%Readers: Everyone
Keywords: Adversarial training, Adversarial robustness, Adversarial feature represenation
Abstract: The appearance of adversarial examples raises attention from both academia and industry. Along with the attack-defense arms race, adversarial training is the most effective against adversarial examples. However, we find inequality phenomena occur during the $l_{\infty}$-adversarial training, that few features dominate the prediction made by the adversarially trained model. We systematically evaluate such inequality phenomena by extensive experiments and find such phenomena become more obvious when performing adversarial training with increasing adversarial strength (evaluated by $\epsilon$). We hypothesize such inequality phenomena make $l_{\infty}$-adversarially trained model less reliable than the standard trained model when few ``important features" are influenced. To validate our hypothesis, we proposed two simple attacks that either perturb or replace important features with noise or occlusion. Experiments show that $l_{\infty}$-adversarially trained model can be easily attacked when the few important features are influenced. Our work shed light on the limitation of the practicality of $l_{\infty}$-adversarial training.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Social Aspects of Machine Learning (eg, AI safety, fairness, privacy, interpretability, human-AI interaction, ethics)
TL;DR: We find an intriguing phenomena of $l_{\infty}$ adversarial training, and this phenomena brings unrealized threats to adversarially trained model.
18 Replies