Provably Safeguarding a Classifier from OOD and Adversarial Samples

Nicolas Atienza; Johanne Cohen; Christophe Labreuche; Michele Sebag

Provably Safeguarding a Classifier from OOD and Adversarial Samples

Nicolas Atienza, Johanne Cohen, Christophe Labreuche, Michele Sebag

Published: 22 Jan 2025, Last Modified: 02 Mar 2025ICLR 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: OOD detection, Adversarial detection, Extreme Value Theory

Abstract: This paper aims to transform a trained classifier into an abstaining classifier, such that the latter is provably protected from out-of-distribution and adversarial samples. The proposed Sample-efficient Probabilistic Detection using Extreme Value Theory (SPADE) approach relies on a Generalized Extreme Value (GEV) model of the training distribution in the latent space of the classifier. Under mild assumptions, this GEV model allows for formally characterizing out-of-distribution and adversarial samples and rejecting them. Empirical validation of the approach is conducted on various neural architectures (ResNet, VGG, and Vision Transformer) and considers medium and large-sized datasets (CIFAR-10, CIFAR-100, and ImageNet). The results show the stability and frugality of the GEV model and demonstrate SPADE’s efficiency compared to the state-of-the-art methods.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 11146

Loading