Provably robust classification of adversarial examples with detectionDownload PDF

Published: 12 Jan 2021, Last Modified: 05 May 2023ICLR 2021 PosterReaders: Everyone
Keywords: Adversarial robustness, robust deep learning
Abstract: Adversarial attacks against deep networks can be defended against either by building robust classifiers or, by creating classifiers that can \emph{detect} the presence of adversarial perturbations. Although it may intuitively seem easier to simply detect attacks rather than build a robust classifier, this has not bourne out in practice even empirically, as most detection methods have subsequently been broken by adaptive attacks, thus necessitating \emph{verifiable} performance for detection mechanisms. In this paper, we propose a new method for jointly training a provably robust classifier and detector. Specifically, we show that by introducing an additional "abstain/detection" into a classifier, we can modify existing certified defense mechanisms to allow the classifier to either robustly classify \emph{or} detect adversarial attacks. We extend the common interval bound propagation (IBP) method for certified robustness under $\ell_\infty$ perturbations to account for our new robust objective, and show that the method outperforms traditional IBP used in isolation, especially for large perturbation sizes. Specifically, tests on MNIST and CIFAR-10 datasets exhibit promising results, for example with provable robust error less than $63.63\%$ and $67.92\%$, for $55.6\%$ and $66.37\%$ natural error, for $\epsilon=8/255$ and $16/255$ on the CIFAR-10 dataset, respectively.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
One-sentence Summary: We propose a joint classifier/detector training scheme with provable performance guarantees against adversarial perturbations.
Supplementary Material: zip
Code: [![github](/images/github_icon.svg) boschresearch/robust_classification_with_detection](
17 Replies