- Keywords: Self-attention, interpretability, identifiability, BERT, Transformer, NLP, explanation, gradient attribution
- TL;DR: We investigate the identifiability and interpretability of attention distributions and tokens within contextual embeddings in the self-attention based BERT model.
- Abstract: In this work we contribute towards a deeper understanding of the Transformer architecture by investigating two of its core components: self-attention and contextual embeddings. In particular, we study the identifiability of attention weights and token embeddings, and the aggregation of context into hidden tokens. We show that attention weights are not unique and propose effective attention as an alternative for better interpretability. Furthermore, we show that input tokens retain their identity in the first hidden layers and then progressively become less identifiable. We also provide evidence for the role of non-linear activations in preserving token identity. Finally, we demonstrate strong mixing of input information in the generation of contextual embeddings by means of a novel quantification method based on gradient attribution. Overall, we show that self-attention distributions are not directly interpretable and present tools to further investigate Transformer models.