Keywords: model interpretation, attention mechanism, causal effect estimation
Abstract: Attention-based methods have played an important role in model interpretations, where the calculated attention weights are expected to highlight the critical parts of inputs (e.g., keywords in sentences). However, recent research points out that attention-as-importance interpretations often do not work as well as we expect. For example, learned attention weights sometimes highlight less meaningful tokens like "[SEP]", ",", and ".", and are frequently uncorrelated with other feature importance indicators like gradient-based measures. Finally, a debate on the effectiveness of attention-based interpretations has raised. In this paper, we reveal that one root cause of this phenomenon can be ascribed to the combinatorial shortcuts, which stand for that the models may not only obtain information from the highlighted parts, but also from the attention weights themselves, as a result, the attention weights are no longer pure importance indicators. We analyze the combinatorial shortcuts theoretically, design one intuitive experiment to demonstrate their existence, and propose two methods to mitigate this issue. Empirical studies on attention-based interpretation models are conducted, and the results show that the proposed methods can effectively improve the interpretability of attention mechanisms on a variety of datasets.
One-sentence Summary: This paper analyzes why sometimes attention mechanisms fail to provide interpretable results from a causal effect estimation perspective, and provides two methods to improve the interpretability of attention mechanisms.
Supplementary Material: zip
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics
Reviewed Version (pdf): https://openreview.net/references/pdf?id=fuWSY3JGKz
5 Replies
Loading