Reproducibility report on Explainable Automated Fact-Checking for Public Health Claims

Anonymous

Reproducibility report on Explainable Automated Fact-Checking for Public Health Claims

Anonymous

05 Feb 2022 (modified: 05 May 2023)ML Reproducibility Challenge 2021 Fall Blind SubmissionReaders: Everyone

Keywords: Fact checking, nlp, data augmentation, summarisation, text classification

Abstract: Scope of Reproducibility: Our work consists of two major parts: Reproducing results from Kotonya & Toni (Authors) and performing experiments to improve test accuracy and other metrics for veracity prediction. We did not use BioBERT model deliberately for veracity prediction as it did not perform well on the defined metrics, as observed in the original paper. Authors were doubtful of how good the rouge metric is in conveying the quality of explanations, so they used human evaluation to evaluate the explanations generated. We stuck to rouge score for evaluating the explanations generated. Methodology: Authors did not publish the code for fine-tuning BERT and SciBERT models for veracity prediction. For explanation generation the authors use a BERT based model which was not made public, so we chose the BART model pre-trained on CNN-DailyMail dataset. We have written a functional and modular code which is easy to reproduce and comprehend. Results: The accuracy for veracity prediction using BERT base model (top 5 sentences) was 3% lower than that published by the authors. The accuracy for veracity prediction using SciBERT (top 5 sentences) was 4.73% lower than that published by the authors. SciBERT performed well on all the test metrics for veracity prediction. While the accuracy was close, the macro F1, precision and recall were inconsistent with the authors’ claim. For explanation generation, the automated evaluation metric was rouge. In case of R1 and RL, we got f1 measure, which was around 30% more than what was mentioned in the paper. Improvements were also observed in R2 rouge score. We also checked some of the explanations that were generated, and the results were up to the mark with gold standard explanations. What was easy: It was easy to implement the code for veracity prediction using two different BERT models. The model used for summarization was available in the Hugging Face library, pre-trained on the same dataset as the authors. Without much effort, we were able to fine-tune the model on our dataset. What was difficult: The implementation code was not available in the authors' GitHub repository. We had to implement code ourselves. It was difficult to increase the accuracy of the models to get close to that published by the authors. Communication with original authors: We tried contacting the authors many times, but unfortunately could not make any contact.

4 Replies

Loading