Abstract: In this paper, we present a novel method to learn a Bayesian neural network robust against adversarial attacks. Previous algorithms have shown an adversarially trained Bayesian Neural Network (BNN) provides improved robustness against attacks. However, the learning approach for approximating the multi-modal Bayesian posterior leads to mode collapse with consequential sub-par robustness and under performance of an adversarially trained BNN. Instead, we propose approximating the multi-modal posterior of a BNN to prevent mode collapse and encourage diversity over learned posterior distributions of models to develop a novel adversarial training method for BNNs. Importantly, we conceptualize and formulate information gain (IG) in the adversarial Bayesian learning context and prove, training a BNN with IG bounds the difference between the conventional empirical risk with the risk obtained from adversarial training---our intuition is that information gain from benign and adversarial examples should be the same for a robust BNN. Extensive experimental results demonstrate our proposed algorithm to achieve state-of-the-art performance under strong adversarial attacks.
13 Replies
Loading