Activation-Deactivation: General Framework for Robust Post-hoc Explainable AI

16 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: explainable AI, black-box explainability, post-hoc explanations, CNN
TL;DR: The paper presents a framework, called activation-deactivation, that replaces masking values in perturbations of inputs with deactivation of the relevant elements of the neural network.
Abstract: Black-box explainability methods are popular tools for explaining the decisions of image classifiers. A major drawback of these tools is their reliance on mutants obtained by occluding parts of the input, leading to out-of-distribution images. This raises doubts about the quality of the explanations. Moreover, choosing an appropriate occlusion value often requires domain knowledge. In this paper we introduce a novel forward-pass paradigm Activation-Deactivation (AD), which removes the effects of occluded input features from the model’s decision-making by switching off the parts of the model that correspond to the occlusions. We introduce CONVAD, a drop-in mechanism that can be easily added to any trained Con- volutional Neural Network (CNN), and which implements the AD paradigm. This leads to more robust explanations without any additional training or fine-tuning. We prove that CONVAD mechanism does not change the decision-making process of the network. We provide experimental evaluation across several datasets and model architectures. We compare the quality of AD-explanations with explana- tions achieved using a set of masking values, using the proxies of robustness, size, and confidence drop-off. We observe a consistent improvement in robustness of AD explanations (up to 62.5%) compared to explanations obtained with occlusions, demonstrating that CONVAD extracts more robust explanations without the need for domain knowledge.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 7985
Loading