PairEval: Open-domain Dialogue Evaluation Metric with Pairwise Comparisons

ChaeHun Park; Minseok Choi; Dohyun Lee; Jaegul Choo

PairEval: Open-domain Dialogue Evaluation Metric with Pairwise Comparisons

ChaeHun Park, Minseok Choi, Dohyun Lee, Jaegul Choo

Published: 10 Jul 2024, Last Modified: 26 Aug 2024COLMEveryoneRevisionsBibTeXCC BY 4.0

Research Area: Evaluation

Keywords: Open-domain dialogue, Dialogue evaluation, Evaluation metric

TL;DR: We introduce a new evaluation metric with comparative assessments for open-domain dialogue systems.

Abstract: Building a reliable and automated evaluation metric is a necessary but challenging problem for open-domain dialogue systems. Recent studies proposed evaluation metrics that assess generated responses by considering their relevance to previous dialogue histories. Although effective, these metrics evaluate individual responses directly rather than considering their relative quality compared to other responses. To handle this, we propose PairEval, a novel dialogue evaluation metric for assessing responses by comparing their quality against responses in different conversations. Our metric is built on top of open-sourced and moderate-size language models, and we make them specialized in pairwise comparison between dialogue responses. Extensive experiments on multiple benchmarks demonstrate that our metric exhibits a higher correlation with human judgments than baseline metrics. We also find that the proposed comparative metric is more robust in detecting common failures from open-domain dialogue systems, including repetition and speaker insensitivity. The codes and models will be publicly available after the paper is accepted.

Supplementary Material: zip

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the COLM Code of Ethics on https://colmweb.org/CoE.html

Author Guide: I certify that this submission complies with the submission instructions as described on https://colmweb.org/AuthorGuide.html

Submission Number: 1378

Loading