Abstract: Deep Neural Network (DNN) classifiers easily yield high confidence for Out-of-Distribution (OOD) examples beyond the training distribution, i.e., In-Distribution (ID), leading to classification errors. Detecting and rejecting various OOD examples is crucial for the reliability of DNNs. More challenging, well-built detections can also suffer from being re-bypassed by adversarial attacks perturbing unseen OOD examples. Some existing works introduce adversarial training on the auxiliary outliers to improve the robustness of OOD detection. However, in this work, we find that applying adversarial training on the auxiliary outliers is insufficient to make the detection robust to strong adaptive attacks. To fix this bug of OOD detection, we propose a semi-supervised adversarial training approach, RobDet, which mines adversarially perturbed ID examples from within the neighborhood of clean ID ones as auxiliary outliers and uses multiple "other" classes to train them together with other auxiliary clean and adversarially perturbed outliers to enhance the robustness of OOD detection without significantly sacrificing the performance on clean OOD examples. Experiments show that RobDet has a significant advantage in detecting malicious OOD examples generated by strong adaptive attacks while maintaining advanced performance in detecting clean OOD examples.
Loading