Relevance in Dialogue: Is Less More? An Empirical Comparison of Existing Metrics, and a Novel Simple MetricDownload PDF

Anonymous

16 Nov 2021 (modified: 05 May 2023)ACL ARR 2021 November Blind SubmissionReaders: Everyone
Abstract: In this work, we evaluate various existing dialogue relevance metrics, find strong dependencies on the dataset, often with poor correlation with human scores of relevance, and propose modifications to reduce data requirements and domain sensitivity while improving correlation. With these changes, our metric achieves state-of-the-art performance on the HUMOD dataset (Merdivan et al., 2020) while reducing measured sensitivity to dataset by 50%. We achieve this without fine-tuning, using only 3750 unannotated human dialogues and a single negative example. Despite these limitations, we demonstrate competitive performance on four datasets from different domains. Our code including our metric and experiments is open sourced.
0 Replies

Loading