Keywords: neural network robustness, adversarial examples
Abstract: A key challenge in analyzing neural networks' robustness is identifying input features for which networks are robust to perturbations. Existing work focuses on direct perturbations to the inputs, thereby studies network robustness to the lowest-level features. In this work, we take a new approach and study the robustness of networks to the inputs' semantic features. We show a black-box approach to determine features for which a network is robust or weak. We leverage these features to obtain provably robust neighborhoods defined using robust features and adversarial examples defined by perturbing weak features. We evaluate our approach with PCA features. We show (1) provably robust neighborhoods are larger: on average by 1.8x and up to 4.5x, compared to the standard neighborhoods, and (2) our adversarial examples are generated using at least 8.7x fewer queries and have at least 2.8x lower L2 distortion compared to state-of-the-art. We further show that our attack is effective even against ensemble adversarial training.
One-sentence Summary: A framework to detect robust and weak features of DNNs and to use them to define robustness neighborhoods and adversarial examples.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): https://openreview.net/references/pdf?id=oHU5RFA97P
9 Replies
Loading