Abstract: Three important criteria of existing convolutional neural networks (CNNs) are (1) test-set accuracy; (2) out-of-distribution accuracy; and (3) explainability. While these criteria have been studied independently, their relationship is unknown. For example, do CNNs with better out-of-distribution performance also have better explainability? Furthermore, most prior explainability studies only evaluate methods on 2-3 common vanilla ImageNet-trained CNNs, leaving it unknown how these methods generalize to CNNs of other architectures and training algorithms. Here, we perform the first large-scale evaluation of the relations of the three criteria using nine feature-importance methods and 12 ImageNet-trained CNNs that are of three training algorithms and five CNN architectures. We find several important insights and recommendations for ML practitioners. First, adversarially robust CNNs have a higher explainability score on gradient-based attribution methods (but not CAM-based or perturbation-based methods). Second, AdvProp models, despite being highly accurate, are not superior in explainability. Third, among the nine feature attribution methods tested, GradCAM and RISE are consistently the best methods.
Fourth, Insertion and Deletion are biased towards vanilla and robust models, respectively, due to their strong correlation with the confidence score distributions of a CNN. Fifth, we did not find a single CNN to be the best in all three criteria, which suggests that CNNs with better performance do not have better explainability. Sixth, ResNet-50 is, on average, the best architecture among the architectures used in this study, which indicates architectures with higher test-set accuracy do not necessarily have better explainability scores.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: - We added a new figure (Figure 1) to capture the motivation for our research, and our series of experiments.
- We make major revised textual changes blue to aid your review (the final text color will be black following TMLR style).
- We released anonymous code to aid the reproduction of the AdvProp training (in reply to a question by reviewer `uqq6`)
**Organization and motivation of our study**
In light of your feedback, we add a new Figure 1 to illustrate the story of our paper and provide a list of concrete questions (at the beginning of Sec. 3).
- We first answer the paper's central question of how adversarial training affect a network's feature-attribution maps (Sec. 3.1).
- Then, we address the question of whether feature-attribution methods are similar when averaged over network architectures and training regimes (Sec. 3.2). Interestingly, GradCAM and RISE consistently are the better ones.
- We then compare network architectures based on their attribution map quality and found ResNet-50 to be the all-around winner (Sec. 3.3).
- Next, we compare models based on three important properties (explainability via feature attribution, number of multiply-accumulate operations, and classification accuracy on real images) and the possible trade-offs practitioners should make (Sec. 3.4).
- From the same study over 5 network architectures, 3 training algorithms, and 9 feature-attribution methods, we find that AdvProp results in the models that are on average, the better training paradigm compared to training on real or adversarial data alone (Sec. 3.5). That is, AdvProp models tend to outperform vanilla models on both classification accuracy and feature-map explainability.
- Finally, interestingly, our novel comparison between vanilla and adversarially-robust models reveal important findings that the common Insertion and Deletion metrics strongly correlate with confidence scores (Sec. 3.6) and weakly correlate with the other common metric of weakly-supervised localization (Sec. 3.7).
Assigned Action Editor: ~Alexander_A_Alemi1
Submission Number: 922
Loading