Keywords: mutual information
Abstract: Feature attribution methods aim to explain model predictions by identifying input features that are most influential to the output. While perturbation-based methods are intuitive and widely used, they often rely on restricted or discretized perturbation spaces, limiting their ability to capture the complex dependencies in high-dimensional data. In this work, we propose a novel framework that defines the perturbation space as a continuous-time stochastic differential equation (SDE), enabling a general, unbounded formulation of the perturbation space. This formulation significantly increases the expressiveness of the perturbation space while introducing new optimization challenges. To address this, we derive a theoretical connection between the KL divergence and the Fisher divergence under general SDEs, and further establish that the time derivative of mutual information between perturbed and original inputs is governed by the Fisher divergence. These results allow us to simplify the attribution objective and compute pointwise information as feature importance scores. Empirical results on large-scale image classification tasks show that our method produces sharper, more coherent, and better localized attribution maps compared to existing approaches, as demonstrated by both qualitative visualizations and quantitative evaluations.
Supplementary Material: zip
Primary Area: interpretability and explainable AI
Submission Number: 8573
Loading