Ditch the Gold Standard: Re-evaluating Conversational Question AnsweringDownload PDF

Anonymous

16 Oct 2021 (modified: 05 May 2023)ACL ARR 2021 October Blind SubmissionReaders: Everyone
Abstract: Conversational question answering (CQA) systems aim to provide natural-language answers to users in information-seeking conversations. Existing benchmarks compare CQA models on pre-collected human-human conversations, with ground-truth answers provided in conversational history. It remains unclear whether we can rely on this static evaluation for model development, or current systems can well generalize to real-world human-machine conversations. In this work, we conduct the first large-scale human evaluation of state-of-the-art CQA systems, where human evaluators converse with models and judge the correctness of their answers. We find that the distribution of human-machine conversations drastically differs from that of human-human conversations, and evaluations using gold answers are inconsistent with human evaluations. We further investigate how to improve automatic evaluations and propose a question rewriting mechanism based on predicted history, which better correlates with human judgments. Finally, we analyze the impact of various modeling strategies. We hope that our findings can shed light on how to develop better CQA systems in the future.
0 Replies

Loading