Provably robust classification of adversarial examples with detectionDownload PDF

Sep 28, 2020 (edited Mar 18, 2021)ICLR 2021 PosterReaders: Everyone
  • Keywords: Adversarial robustness, robust deep learning
  • Abstract: Adversarial attacks against deep networks can be defended against either by building robust classifiers or, by creating classifiers that can \emph{detect} the presence of adversarial perturbations. Although it may intuitively seem easier to simply detect attacks rather than build a robust classifier, this has not bourne out in practice even empirically, as most detection methods have subsequently been broken by adaptive attacks, thus necessitating \emph{verifiable} performance for detection mechanisms. In this paper, we propose a new method for jointly training a provably robust classifier and detector. Specifically, we show that by introducing an additional "abstain/detection" into a classifier, we can modify existing certified defense mechanisms to allow the classifier to either robustly classify \emph{or} detect adversarial attacks. We extend the common interval bound propagation (IBP) method for certified robustness under $\ell_\infty$ perturbations to account for our new robust objective, and show that the method outperforms traditional IBP used in isolation, especially for large perturbation sizes. Specifically, tests on MNIST and CIFAR-10 datasets exhibit promising results, for example with provable robust error less than $63.63\%$ and $67.92\%$, for $55.6\%$ and $66.37\%$ natural error, for $\epsilon=8/255$ and $16/255$ on the CIFAR-10 dataset, respectively.
  • Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
  • One-sentence Summary: We propose a joint classifier/detector training scheme with provable performance guarantees against adversarial perturbations.
  • Supplementary Material: zip
17 Replies