DialogueScore: Evaluating Responses in Task-Oriented DialogueDownload PDF

Anonymous

16 Jan 2022 (modified: 05 May 2023)ACL ARR 2022 January Blind SubmissionReaders: Everyone
Abstract: Task-Oriented Dialogue systems have been widely deployed in real-world applications in the last few years.Yet, evaluations of task-oriented dialogue systems are relatively limited.The informative and success score only consider the key entities in the generated responses to judge whether the user's goal is achieved.On the other hand, the fluency metric (BLEU score) cannot measure the quality of the short responses properly since the golden responses could be diversified. To better explore the behavior and evaluate the generation ability of task-oriented dialogue systems, we explore the relation between user utterances and system responses and their follow-up utterances. Therefore, we design a scorer named \textbf{DialogueScore} based on the natural language inference task and synthesize negative data to train the scorer.Via performances of \textbf{DialogueScore}, we observe that the dialogue system fails to generate high-quality responses compared with the reference responses. Therefore, our proposed scorer could provide a new perspective for future dialogue system evaluation and construction.
Paper Type: long
0 Replies

Loading