Abstract: Input-gradient-based attribution methods, such as Vanilla Gradient, Integrated Gradients, and SmoothGrad, are widely used to explain image classifiers via saliency maps. However, these methods often produce explanations that are noisy, or unstable. While prior work primarily focuses on refining the explanation techniques themselves, we explore a complementary model-centered perspective grounded in explainability-by-design. Specifically, we examine how adversarial training affects saliency map quality and propose a lightweight feature-map smoothing mechanism that can be integrated during training. Evaluating across FMNIST, CIFAR-10, and ImageNette, we find that local smoothing (e.g., mean, median filters) improves stability and perceived clarity of explanations while preserving sparsity gains from adversarial training. However, gains in faithfulness are method and dataset dependent, highlighting that interpretability improvements may not generalize uniformly. A user study with 65 participants further confirms that explanations from smoothed adversarial models are perceived as more comprehensible and trustworthy. Our work highlights the value of model-level interventions for improving post-hoc explanations. Our code is available at \url{https://anonymous.4open.science/r/ImprovingVG-2BFA/README.md}.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=POaVh2Gu14
Changes Since Last Submission: In response to constructive feedback from the last submission, we have made the following major revisions to clarify our motivation, tighten the paper’s claims, and better support them with evidence:
- The title of the paper now is modified for making it more informative.
- The abstract and introduction have been rewritten to clearly frame the paper under the lens of explainability-by-design, improving explanation quality by modifying model training and architecture, rather than changing the explanation methods themselves.
- We have removed non-local smoothing variants from the analysis. These methods produced inconsistent results from the core focus of the paper. The revised experiments now focus on local filters (mean, median, Gaussian), improving coherence of the results.
- We added a new experiment to isolate the effect of receptive field expansion without smoothing. Specifically, we study if quality of saliency map benefits from smoothing filter or the expansion of receptive field due to convolution operation.
- We revised the results section to align narrative with the evidence, especially in faithfulness metrics. Overstatements have been removed, and dataset and method specific variability is now acknowledged directly in both discussion and conclusion.
Assigned Action Editor: ~Magda_Gregorova2
Submission Number: 5125
Loading