BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn DialoguesDownload PDF

Anonymous

16 Dec 2023ACL ARR 2023 December Blind SubmissionReaders: Everyone
Abstract: In the realm of modern Large Language Models (LLMs), facilitating high-quality, multi-turn dialogues with humans represents a cornerstone feature. However, human-based evaluation of such capability involves substantial manual effort. This study offers a formative assessment of current LLMs' proficiency in emulating human-like, multi-turn conversations, employing an LLM-based methodology. The evaluation encompasses three key elements in the evaluation pipeline: utterance generation, evaluation protocol, and judgement, and we delve deeply into each aspect. GPT-4, both as an utterance generator and a judge, exhibits exceptional performance. As a generator, GPT-4 crafts dialogues indistinguishable from human interactions in terms of style and flow. When judging, it shows a heightened alignment with human evaluative standards and consistency. Conversely, other LLMs face challenges in producing quality multi-turn dialogues, hindered by inadequate instruction-following abilities, a propensity for prolix utterances, and overall limited capabilities. Notably, generating extensive dialogues (e.g., spanning tens of turns) remains a formidable task for most LLMs, particularly in Chinese contexts. We hope our work can serve as a valuable resource for evaluating multi-turn chatting capabilities of LLMs. This research aims to contribute a robust framework for assessing the multi-turn conversation abilities of LLMs, hoping to guide future advancements in this domain.
Paper Type: long
Research Area: Dialogue and Interactive Systems
Contribution Types: NLP engineering experiment
Languages Studied: English, Chinese
0 Replies

Loading