- Keywords: BERT, Integrated Gradients, Interpretability, Question Answering, RCQA
- Abstract: In the paper, the authors attempt to understand BERT's exemplary performance for RCQA tasks by defining each Self-Attention Layer's role using Integrated Gradients for SQuAD v1.1 and DuoRC SelfRC datasets. After this, they follow through with experiments and analysis to infer how each layer works to predict the answer, based on the context and question. Scope of Reproducibility Ramnath et al. suggest that the initial layers focus on query-passage interaction, while the later layers focus more on contextual understanding and enhancing answer prediction. In our reproducibility plan, we aim to validate this claim and other related claims by completely replicating the authors' experiments to analyze BERT layers to understand their RCQA-specific role and their behavior on potentially confusing Quantifier Questions. Methodology Since this paper's official code is not available, we prepare our scripts and modules for processing the data and re-implement the approach as described in the paper. We refer to the original research paper to cross-check our results with their reporting. We use Google Colab's free GPU for 35-40 hours for fine-tuning the model and calculating the Integrated Gradients. The rest of the experiments can be performed on a CPU within 10-15 hours. Results Our reproduced results for all experiments support the central claim made in the paper. All of our statistics and plots agree with those in the original paper within a good margin. We have also analyzed some results beyond the paper and find that the scope of the original paper is transferable and generalizable. What was easy Using HuggingFace Transformers and Datasets for the SQuAD v1.1 was easy as we could adapt the authors' ideas to our code experiments and verify their central claim without much effort. There are also libraries readily available for Jensen-Shannon Divergence and t-SNE and could be used easily. What was difficult Re-implementing the paper was more difficult than we expected as there were ambiguities and conflicts in our approaches for Integrated Gradients calculation, as well as DuoRC preprocessing and postprocessing. There were differences in our methods of implementation, and multiple iterations had to be performed to decide upon the case to be used, which took up a lot of computational power unnecessarily. Communication with original authors We had frequent interaction with the first author via email for clarification and discussion.
- Paper Url: https://openreview.net/forum?id=bpDFfs40geg
- Supplementary Material: zip