Revisiting Adversarial Robustness of Classifiers With a Reject OptionDownload PDF

Published: 02 Dec 2021, Last Modified: 05 May 2023AAAI-22 AdvML Workshop OralReaders: Everyone
Keywords: adversarial robustness, robust classification with rejection, classification with abstain, adversarial example detection
TL;DR: We propose a novel metric and a training method for learning a robust classifier with reject option in the presence of adversarial inputs.
Abstract: Adversarial training of deep neural networks (DNNs) is an important defense mechanism that allows a DNN to be robust to input perturbations, that can otherwise result in predictions errors. Recently, there is a growing interest in learning a classifier with a reject (abstain) option that can be more robust to adversarial perturbations by choosing to not return a prediction on inputs where the classifier may be incorrect. A challenge faced with robust learning of a classifier with reject option is that existing works do not have a mechanism to ensure that (very) small perturbations of the input are \textit{not} rejected, when they can in fact be accepted and correctly classified. We first propose a novel metric -- \textit{robust error with rejection} -- that extends the standard definition of robust error to include the rejection of small perturbations. The proposed metric has natural connections to the standard robust error (without rejection), as well as the robust error with rejection proposed in a recent work. Motivated by this metric, we propose novel loss functions and a robust training method -- \textit{stratified adversarial training with rejection} (SATR) -- for a classifier with reject option, where the goal is to accept and correctly-classify small input perturbations, while allowing the rejection of larger input perturbations that cannot be correctly classified. Experiments on well-known image classification DNNs using strong adaptive attack methods validate that SATR can significantly improve the robustness of a classifier with rejection compared to standard adversarial training (with confidence-based rejection) as well as a recently-proposed baseline.
2 Replies