Adversarial Attacks as Near-Zero Eigenvalues in The Empirical Kernel of Neural Networks

Ouns El Harzli; Bernardo Cuenca Grau

Adversarial Attacks as Near-Zero Eigenvalues in The Empirical Kernel of Neural Networks

Ouns El Harzli, Bernardo Cuenca Grau

22 Sept 2023 (modified: 11 Feb 2024)Submitted to ICLR 2024EveryoneRevisionsBibTeX

Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Keywords: adversarial attacks, neural networks, kernels

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.

Abstract: Adversarial examples ---imperceptibly modified data inputs designed to mislead machine learning models--- have raised concerns about the robustness of modern neural architectures in safety-critical applications. In this paper, we propose a unified mathematical framework for understanding adversarial examples in neural networks, corroborating Ian Goodfellow's original conjecture that such examples are exceedingly rare, despite their presence in the proximity of nearly every test case. By exploiting results from Kernel Theory, we characterise adversarial examples as those producing near-zero Mercer's eigenvalues in the empirical kernel associated to a trained neural network. Consequently, the generation of adversarial attacks, using any known technique, can be conceptualised as a progression towards the eigenvalue space's zero point within the empirical kernel. We rigorously prove this characterisation for trained fully-connected neural networks under mild assumptions on the nonlinear activation function, thus providing a mathematical explanation for the apparent contradiction of neural networks excelling at generalisation while remaining vulnerable to adversarial attacks. In practical experiments conducted on the MNIST dataset, we have verified that adversarial examples generated through the widely-used Deep Fool algorithm do, indeed, lead to a shift in the distribution of Mercer's eigenvalues toward zero. These results are in strong agreement with predictions of our theoretical framework.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.

Supplementary Material: zip

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 5540

Loading