Keywords: Adversarial Robustness, Statistical Guarantees, Deep Neural Networks, Bayesian Neural Networks
TL;DR: Given a neural network f we investigate the global adversarial robustness properties of f, showing how these can be computed up to any a priori specified statistical error.
Abstract: We investigate global adversarial robustness guarantees for machine learning models. Specifically, given a trained model we consider the problem of computing the probability that its prediction at any point sampled from the (unknown) input distribution is susceptible to adversarial attacks. Assuming continuity of the model, we prove measurability for a selection of local robustness properties used in the literature. We then show how concentration inequalities can be employed to compute global robustness with estimation error upper-bounded by $\epsilon$, for any $\epsilon > 0$ selected a priori. We utilise the methods to provide statistically sound analysis of the robustness/accuracy trade-off for a variety of neural networks architectures and training methods on MNIST, Fashion-MNIST and CIFAR. We empirically observe that robustness and accuracy tend to be negatively correlated for networks trained via stochastic gradient descent and with iterative pruning techniques, while a positive trend is observed between them in Bayesian settings.
Original Pdf: pdf
9 Replies
Loading