Abstract: Response diversity has become an important criterion for evaluating the quality of open-domain dialogue generation models. However, current evaluation metrics for response diversity do not capture semantic diversity of generated responses, as they only consider lexical aspects of the responses. In this paper, we introduce a new automatic evaluation metric to measure the semantic diversity of generated responses. Through human evaluation, we demonstrate that our proposed metric highly correlates to human judgments on response diversity than existing lexical-level diversity metrics. Furthermore, motivated by the analysis of an existing dialogue dataset, we propose a simple yet effective learning method that improves the semantic diversity of generated responses through response re-weighting based on the semantic distribution of the training dataset. Through automatic and human evaluation, we show that our proposed learning method better improves both response diversity and coherency compared to other baseline methods.
Paper Type: long
0 Replies
Loading