Verifiable Feature Attributions: A Bridge between Post Hoc Explainability and Inherent Interpretability
Keywords: Machine Learning Explainability, Machine Learning Interpretability
Abstract: As machine learning models are increasingly employed in medicine, researchers, healthcare organizations, providers, and patients have all emphasized the need for greater transparency. To provide explanations of models in high-stakes applications, two broad strategies have been outlined in prior literature. Post hoc explanation methods explain the behaviour of complex black-box models by highlighting image regions critical to model predictions; however, prior work has shown that these explanations may not be faithful, and even more concerning is our inability to verify them. Specifically, it is nontrivial to evaluate if a given feature attribution is correct with respect to the underlying model. Inherently interpretable models, on the other hand, circumvent this by explicitly encoding explanations into model architecture, making their explanations naturally faithful and verifiable, but they often exhibit poor predictive performance due to their limited expressive power. In this work, we aim to bridge the gap between the aforementioned strategies by proposing Verifiability Tuning (VerT), a method that transforms black-box models into models with verifiable feature attributions. We begin by introducing a formal theoretical framework to understand verifiability and show that attributions produced by standard models cannot be verified. We then leverage this framework to propose a method for building verifiable models and feature attributions from black-box models. Finally, we perform extensive experiments on semi-synthetic and real-world datasets, and show that VerT produces models (1) yield explanations that are correct and verifiable and (2) are faithful to the original black-box models they are meant to explain.
Submission Number: 66
Loading