Defuse: Debugging Classifiers Through Distilling Unrestricted Adversarial ExamplesDownload PDF

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone
Keywords: debugging, interpretability, explainability
Abstract: With the greater proliferation of machine learning models, the imperative of diagnosing and correcting bugs in models has become increasingly clear. As a route to better discover and fix model bugs, we propose failure scenarios: regions on the data manifold that are incorrectly classified by a model. We propose an end-to-end debugging framework called Defuse to use these regions for fixing faulty classifier predictions. The Defuse framework works in three steps. First, Defuse identifies many unrestricted adversarial examples--naturally occurring instances that are misclassified--using a generative model. Next, the procedure distills the misclassified data using clustering into failure scenarios. Last, the method corrects model behavior on the distilled scenarios through an optimization based approach. We illustrate the utility of our framework on a variety of image data sets. We find that Defuse identifies and resolves concerning predictions while maintaining model generalization.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
One-sentence Summary: We propose failure scenarios -- regions in the latent space of a generative model which are heaviliy misclassified -- and Defuse -- a framework that fixes the predictions in these scenarios.
Reviewed Version (pdf): https://openreview.net/references/pdf?id=PADuPm8L9j
8 Replies

Loading