Explaining a Black Box without another Black Box

Anonymous

Explaining a Black Box without another Black Box

Anonymous

16 Dec 2022 (modified: 05 May 2023)ACL ARR 2022 December Blind SubmissionReaders: Everyone

Abstract: Although Counterfactual Explanation Methods (CEMs) are popular approaches to explain ML classifiers, they are less widespread in NLP. A counterfactual explanation encodes the smallest changes required in the target document to modify the classifier's output. Most CEMs find those explanations by iteratively perturbing the document until it is classified differently by the black box. We identified two main families of approaches for CEMs in the literature, namely, (a) transparent methods that perturb the target by adding, removing, or replacing words, and (b) opaque approaches that project the target document onto a latent, non-interpretable space where the perturbation is carried out subsequently. This article offers a comparative study of the performance of these two families of methods on three classical NLP tasks. Our empirical evidence shows that opaque CEMs can be overkill for downstream applications such as fake news detection or sentiment analysis since they add an additional level of opaqueness with no significant performance gain.

Paper Type: short

Research Area: Interpretability and Analysis of Models for NLP

0 Replies

Loading