Keywords: uncertainty estimation, adversarial-training
Abstract: The deep learning models' sensitivity to small input perturbations raises security concerns and limits their use for applications where reliability is critical. While adversarial training methods aim at training more robust models, these techniques often result in a lower unperturbed (clean) test accuracy, including the most widely used Projected Gradient Descent (PGD) method. In this work, we propose uncertainty-targeted attacks (UTA), where the perturbations are obtained by maximizing the model's estimated uncertainty. We demonstrate on MNIST, Fashion-MNIST and CIFAR-10 that this approach does not drastically deteriorate the clean test accuracy relative to PGD whilst it is robust to PGD attacks. In particular, uncertainty-based attacks allow for using larger $L_\infty$-balls around the training data points, are less prone to overfitting the attack, and yield improved generalization-robustness trade-off.
18 Replies
Loading