Keywords: Adversarial Perturbations, Adversarial Examples, Adversarial Attacks, Non-Robust Features
TL;DR: We first prove that neural networks can learn natural data structures from adversarial perturbations.
Abstract: It is not fully understood why adversarial examples can deceive neural networks and transfer between different networks. To elucidate this, several studies hypothesized that adversarial perturbations contain data features that are imperceptible to humans but still recognizable by neural networks. Empirical evidence has shown that neural networks trained on mislabeled samples with these perturbations can generalize to natural test data. However, a theoretical understanding of this counterintuitive phenomenon is limited. In this study, assuming orthogonal training samples, we first prove that one-hidden-layer neural networks can learn natural data structures from adversarial perturbations. Our results indicate that, under mild conditions, the decision boundary from learning perturbations aligns with that from natural data, except for specific points in the input space.
Submission Number: 43
Loading