Learning to Deceive With Attention-Based Explanations

Andrew Harrison; Rahel Habacker; Ard Snijders; Mathias Parisot

Learning to Deceive With Attention-Based Explanations

Andrew Harrison, Rahel Habacker, Ard Snijders, Mathias Parisot

Published: 01 Apr 2021, Last Modified: 05 May 2023RC2020Readers: Everyone

Keywords: Attention-based explanation, Attention, BERT, BiLSTM, Embeddings

Abstract: Reproducibility Summary Scope of Reproducibility Based on the intuition that attention in neural networks is what the model focuses on, attention is now being used as an explanation for a models' prediction (see Galassi et al. (2020) for survey). Pruthi et al. (2020) challenge the usage of attention-based explanation through a series of experiments using classification and sequence-to-sequence (seq2seq) models. They examine the model's use of impermissible tokens, which are user defined tokens that can introduce bias e.g. gendered pronouns. Across multiple datasets, the authors show that with the impermissible tokens removed the model accuracy drops, implying their usage in prediction. And then by penalising attention paid to the impermissible tokens but keeping them in, they train models that retain full accuracy hence must be using the impermissible tokens, but that do not show attention being paid to the impermissible tokens. As the paper's claims have such significant implications for the use of attention-based explanations, we seek to reproduce their results. Methodology Using the authors code, for classifiers we attempt to reproduce their embedding, BiLSTM, and BERT results across the occupation prediction, gender identify, and SST + wiki datasets. Further, we reimplemented BERT using HuggingFace's transformer library (Wolf et al., 2019) with restricted self-attention (information cannot flow between permissible and impermissible tokens). For seq2seq we used the authors' code to reproduce results across bigram flip, sequence copy, sequence reverse, and English-German (En-De) machine translation datasets. We performed refactoring on the author's code aiming toward a more uniformly usable code style as well as porting across to PyTorch Lightning. All experiments were run in approximately 130 GPU hours on a computing cluster with nodes containing Titan RTX GPUs. Results We reproduced the authors results across all models and all available datasets, confirming their findings that attention-based explanations can be manipulated and that models can learn to deceive. We also replicated their BERT results using our reimplemented model. There was only one result not as strongly (> 1 S.D.) in their experimental direction. What Was Easy The author's methods were largely well described and easy to follow, and we could quickly produce the first results as their code worked straightaway with minor adjustments. Author's were also extremely responsive and helpful via email. What Was Difficult Re-implementing the BERT-based classification model to perform replicability, with further specification details on model architecture, penalty mechanism and training procedure needed. Also, porting code across to PyTorch Lightning. Communication With Original Authors There was a continuous email chain with the authors over the course of several weeks during the reproducibility work. They made additional code and datasets available per our requests, along with providing detailed responses and clarifications to our emailed questions. They encouraged the work and we wish to thank them for their time and support.

Paper Url: https://openreview.net/forum?id=ZVxchkVPa8S

Supplementary Material: zip

4 Replies

Loading