Double Trouble: How to not explain a text classifier's decisions using counterfactuals synthesized by masked language models?
Abstract: A principle behind dozens of attribution methods is to take the prediction difference between before-and-after an input feature (here, a token) is removed as its attribution - the individual treatment effect in causal inference.A recent popular Input Marginalization (IM) method (Kim et al. EMNLP 2020) uses BERT to replace a token---\ie simulating the $do(.)$ operator - yielding more plausible counterfactuals. While Kim et al. EMNLP 2020 reported that IM is effective, we find this conclusion not convincing as the \deletionBert metric used in their paper is biased towards IM. Importantly, this bias should exist in many Deletion-based metrics, e.g., Insertion (Arras et al. 2017), Sufficiency, and Comprehensiveness (De Young et al. ACL 2020).Furthermore, our rigorous evaluation using 6 metrics and on 3 datasets finds no evidence that IM is better than a Leave-One-Out (LOO) baseline. We provide two explanations for why IM is not better than LOO: (1) deleting a single word from the input only marginally reduces a classifier's accuracy; and (2) a highly predictable word is always given near-zero attribution which may not match its true importance to the target classifier.
Paper Type: long
0 Replies
Loading