Keywords: interpretation, robustness
Abstract: Model attribution is a critical component of deep neural networks (DNNs) for its interpretability to complex models. Recent works bring up attention to the security of attributions as they are vulnerable to attribution attacks that generate similar images with dramatically different attributions. Studies have been working on empirically improving the robustness of DNNs against those attacks. However, due to their lack of certification, the actual robustness of the model for a testing point is not known. In this work, we define \emph{certified attribution robustness} for the first time that upper bounds the dissimilarity of attributions after the samples are perturbed by any noises within a certain region while the classification results remain the same. Based on the definition, we propose different approaches to certify the attributions using Euclidean distance and cosine similarity under both $\ell_2$ and $\ell_\infty$-norm perturbations constraints. The bounds developed by our theoretical study are validated on three datasets (MNIST, Fashion-MNIST and CIFAR-10), and two different types of attacks (PGD attack and IFIA attribution attack). The experimental results show that the bounds certify the model effectively.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Submission Guidelines: Yes
Please Choose The Closest Area That Your Submission Falls Into: Social Aspects of Machine Learning (eg, AI safety, fairness, privacy, interpretability, human-AI interaction, ethics)
5 Replies
Loading