Abstract: Input-gradient-based feature attribution methods, such as Vanilla Gradient, Integrated Gradients, and SmoothGrad, are widely used to explain image classifiers by generating saliency maps. However, these methods struggle to provide explanations that are both visually clear and quantitatively robust. Key challenges include ensuring that explanations are sparse, stable, and faithfully reflect the model’s decision-making. Adversarial training, known for enhancing model robustness, have been shown to produce sparser explanations with these methods; however, this sparsity often comes at the cost of stability. In this work, we investigate the trade-off between stability and sparsity in saliency maps and propose the use of a smoothing layer during adversarial training. Through extensive experiments and evaluation, we demonstrate this smoothing technique improves the stability of saliency maps without sacrificing sparsity. Furthermore, a qualitative user study reveals that human evaluators tend to distrust explanations that are overly noisy or excessively sparse—issues commonly associated with explanations in naturally and adversarially trained models, respectively and prefer explanations produced by our proposed approach. Our findings offer a promising direction for generating reliable explanations with feature-map smoothed adversarially trained models, striking a balance between clarity and usability.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: Following changes are made to the revised paper:
- Softened the claim of faithfulness improvement focusing on the contributions that are empirically supported. Changes are made to abstract, introduction, conclusion and Sections 4.2.4 and 4.2.5
- Computed dROAD scores similar to sparsity-stability evaluation and added the results in Table 1
- Recomputed faithfulness scores in Sections 4.2.4 and 4.2.5 using area under perturbation curve of ROAD plot, to maintain consistency of metrics
- Rewritten Sections 4.2.4 and 4.2.5 by datasets for more clarity
- Added limitations to the work as suggested by the reviewers
This version also includes the changes that were requested in earlier reviews and includes the following:
- Included important details that were in Appendix to the main paper
- Included motivation and results of qualitative experiments
- Acronyms used for model are replaced by full names in Tables, Figures and analysis for clarity
- Analysis of each saliency map metric are grouped by dataset for clarity
- Added evaluation of faithfulness with random input gradient as a baseline and comparative analysis
- Added area under perturbation curve as dROAD for quantitative comparison of ROAD analysis plot
- Added assumptions of the work such as modeling a single layer network to the limitation section
- Added related works that discuss similar works and support the findings
Assigned Action Editor: ~Magda_Gregorova2
Submission Number: 4193
Loading