Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation

Bo Pang; Erik Nijkamp; Wenjuan Han; Alex Zhou

Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation

Bo Pang, Erik Nijkamp, Wenjuan Han, Alex Zhou

25 Sept 2019 (modified: 05 May 2023)ICLR 2020 Conference Withdrawn SubmissionReaders: Everyone

TL;DR: We propose automatic metrics to holistically evaluate open-dialogue generation and they strongly correlate with human evaluation.

Abstract: Open-domain dialogue generation has gained increasing attention in Natural Language Processing. Comparing these methods requires a holistic means of dialogue evaluation. Human ratings are deemed as the gold standard. As human evaluation is inefficient and costly, an automated substitute is desirable. In this paper, we propose holistic evaluation metrics which capture both the quality and diversity of dialogues. Our metrics consists of (1) GPT-2 based context coherence between sentences in a dialogue, (2) GPT-2 based fluency in phrasing, and, (3) $n$-gram based diversity in responses to augmented queries. The empirical validity of our metrics is demonstrated by strong correlation with human judgments. We provide the associated code, datasets and human ratings.

Keywords: open-dialogue system, generation evaluation, natural language processing

Original Pdf: pdf

4 Replies

Loading