Automatic Evaluation vs. User Preference in Neural Textual QuestionAnswering over COVID-19 Scientific LiteratureDownload PDF

04 Sep 2020 (modified: 09 Oct 2020)EMNLP 2020 Workshop NLP-COVID SubmissionReaders: Everyone
  • Keywords: Question Answering, Information Retrieval, COVID-19, CORD-19
  • TL;DR: Evaluation of a Question Answering tool for COVID-19.
  • Abstract: We present a Question Answering (QA) system that won one of the tasks of the Kaggle CORD-19 Challenge, according to the qualitative evaluation of experts. The system is a combination of an Information Retrieval module and a reading comprehension module that finds the answers in the retrieved passages. In this paper we present a quantitative and qualitative analysis of the system. The quantitative evaluation using manually annotated datasets contradicted some of our design choices, e.g. the fact that using QuAC for fine-tuning provided better answers over just using SQuAD. We analyzed this mismatch with an additional A/B test which showed that the system using QuAC was indeed preferred by users, confirming our intuition. Our analysis puts in question the suitability of automatic metrics and its correlation to user preferences. We also show that automatic metrics are highly dependent on the characteristics of the gold standard, such as the average length of the answers.
6 Replies

Loading