Evaluating Language Models in Longer Conversational Contexts

Ilija Subasic; Andrew Rabinovich; Zhao Chen

Evaluating Language Models in Longer Conversational Contexts

Ilija Subasic, Andrew Rabinovich, Zhao Chen

11 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: evaluation, long-form conversation, llm-as-a-judge, dataset

TL;DR: Annotated expert made dataset consisting of long-form conversations, with a study of evaluation metrics in multi-turn conversation settings.

Abstract: Evaluating long-form conversations between humans and large language models (LLMs) presents a significant challenge in the field of natural language processing. Traditional evaluation metrics and benchmarks have largely focused on shorter language interactions and often fail to capture the nuanced inherent in extended dialogues. To address this, we introduce UPHELD, a publicly available dataset featuring human-annotated long-form dialogues. This dataset not only facilitates robust benchmarking but also serves as a foundation for further research into conversation evaluation methodologies. Using our dataset, we systematically analyze the correlation between current LLM evaluation metrics and human judgment within long-form conversation scenarios. Our findings reveal that conventional metrics lack the sensitivity necessary to assess the complex and often subjective nature of prolonged interactions. We use our dataset to develop an improved evaluation metric that demonstrates a significantly higher correlation with human assessments. The work highlights the need for advanced metric designs and outlines a clear pathway to refine the evaluation of LLM long-form conversations.

Supplementary Material: zip

Primary Area: datasets and benchmarks

Submission Number: 4199

Loading