Accuracy of white box and black box adversarial attacks on a sign activation 01 loss neural network ensemble
Keywords: adversarial attack, 01 loss neural network, sign activation, robust machine learning
Abstract: In this work we ask the question: is an ensemble of single hidden layer sign activation 01 loss networks more robust to white box and black box adversarial attacks than an ensemble of its differentiable counterpart of cross-entropy loss with relu activations and an ensemble of the approximate differentiable counterpart of cross-entropy loss with sign activations? We consider a simple experimental setting of attacking models trained for binary classification on pairwise CIFAR10 datasets - altogether a total of 45 datasets. We study ensembles of {\bf bcebp}: binary cross-entropy loss with relu activations trained with back-propagation, {\bf bceban}: binary cross-entropy loss with sign activations trained with back-propagation with the straight through estimator gradient, {\bf 01scd}: 01-loss with sign activations trained with gradient-free stochastic coordinate descent, and {\bf bcescd}: binary cross-entropy loss with relu activation trained with gradient-free stochastic coordinate descent (to isolate the effect of 01 loss from gradient-free training). We train each model in an ensemble with a different random number generator seed. Our four models have similar mean test accuracies in the mid to high 80s on pairwise CIFAR10 datasets but under powerful PGD white-box attacks they each drop to near 0\% except for our 01 loss network ensemble that has 31\% accuracy. Even training with the gradient-free stochastic coordinate descent can be attacked thus suggesting that the defense lies in 01 loss. In a black-box transfer attack we find adversaries produced from the bcebp model fully transfer to bceban but much less to 01scd - we see the same transferability pattern from bceban to bcebp and 01scd. We also find that adversaries from 01scd barely transfer to bcebp and bceban. While our results are far from those of multi-class and convolutional networks, they suggest that 01 loss models are hard to attack naturally without any adversarial training. All models, data, and code to reproduce results here are available from \url{https://github.com/xyzacademic/mlp01example}.
TL;DR: We show that a sign activation 01 loss neural network ensemble is harder to adversarially attack than relu activation and sign activation cross entropy loss counterparts.
6 Replies
Loading