MEEP: Is this Engaging? Prompting Large Language Models for Dialogue Evaluation in Multilingual Settings

Amila Ferron; Amber Shore; Ekata Mitra; Ameeta Agrawal

MEEP: Is this Engaging? Prompting Large Language Models for Dialogue Evaluation in Multilingual Settings

Amila Ferron, Amber Shore, Ekata Mitra, Ameeta Agrawal

Published: 07 Oct 2023, Last Modified: 01 Dec 2023EMNLP 2023 FindingsEveryoneRevisionsBibTeX

Submission Type: Regular Long Paper

Submission Track: Dialogue and Interactive Systems

Submission Track 2: Natural Language Generation

Keywords: automatic evaluation of dialogue, dialogue evaluation, multilingual, metrics, engagingness, prompting, LLM, large language model, multilinguality

TL;DR: We find that using selected novel prompt constructs with LLMs, including our comprehensive definition of engagingness, outperforms state-of-the-art methods on evaluation of engagingness in dialogue across multiple languages.

Abstract: As dialogue systems become more popular, evaluation of their response quality gains importance. Engagingness highly correlates with overall quality and creates a sense of connection that gives human participants a more fulfilling experience. Although qualities like coherence and fluency are readily measured with well-worn automatic metrics, evaluating engagingness often relies on human assessment, which is a costly and time-consuming process. Existing automatic engagingness metrics evaluate the response without the conversation history, are designed for one dataset, or have limited correlation with human annotations. Furthermore, they have been tested exclusively on English conversations. Given that dialogue systems are increasingly available in languages beyond English, multilingual evaluation capabilities are essential. We propose that large language models (LLMs) may be used for evaluation of engagingness in dialogue through prompting, and ask how prompt constructs and translated prompts compare in a multilingual setting. We provide a prompt-design taxonomy for engagingness and find that using selected prompt elements with LLMs, including our comprehensive definition of engagingness, outperforms state-of-the-art methods on evaluation of engagingness in dialogue across multiple languages.

Submission Number: 1182

Loading