Keywords: Adversarial training, Adversarial examples, Minimax, Robustness
Abstract: Adversarial training promises a defense against adversarial perturbations in terms of average accuracy. In this work, we identify that the focus on the average accuracy metric can create vulnerabilities to the "weakest" class. For instance, on CIFAR10, where the average accuracy is 47%, the worst class accuracy can be as low as 14%. The performance sacrifice of the weakest class can be detrimental for real-world systems, if indeed the threat model can adversarially choose the class to attack. To this end, we propose to explicitly minimize the worst class error, which results in a min-max-max optimization formulation. We provide high probability convergence guarantees of the worst class loss for our method, dubbed as class focused online learning (CFOL), which can be plugged into existing training setups with virtually no overhead in computation. We observe significant improvements on the worst class accuracy of 30% for CIFAR10. We also observe consistent behavior across CIFAR100 and STL10. Intriugingly, we find that minimizing the worst case can even sometimes improve the average.
One-sentence Summary: We observe that the worst class can have much lower accuracy than the average class in adversarial examples, which we mitigate by solving a minmax formulation with tools from the bandit literature.
17 Replies
Loading