Keywords: evaluation, long-form conversation, llm-as-a-judge, dataset
TL;DR: Annotated expert made dataset consisting of long-form conversations, with a study of evaluation metrics in multi-turn conversation settings.
Abstract: Evaluating long-form conversations between humans and large language models (LLMs) presents a significant challenge in the field of natural language processing. Traditional evaluation metrics and benchmarks have largely focused on shorter language interactions and often fail to capture the nuanced inherent in extended dialogues. To address this, we introduce UPHELD, a publicly available dataset featuring human-annotated long-form dialogues. This dataset not only facilitates robust benchmarking but also serves as a foundation for further research into conversation evaluation methodologies. Using our dataset, we systematically analyze the correlation between current LLM evaluation metrics and human judgment within long-form conversation scenarios. Our findings reveal that conventional metrics lack the sensitivity necessary to assess the complex and often subjective nature of prolonged interactions. We use our dataset to develop an improved evaluation metric that demonstrates a significantly higher correlation with human assessments. The work highlights the need for advanced metric designs and outlines a clear pathway to refine the evaluation of LLM long-form conversations.
Supplementary Material: zip
Primary Area: datasets and benchmarks
Submission Number: 4199
Loading