Evaluating Language Models in Realistic Conversational Contexts

Ilija Subasic; Andrew Rabinovich; Zhao Chen

Evaluating Language Models in Realistic Conversational Contexts

Ilija Subasic, Andrew Rabinovich, Zhao Chen

Published: 02 Mar 2026, Last Modified: 01 Apr 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0

Keywords: dataset, dialogue, llm-as-a-judge

TL;DR: High quality, human expert created, dataset for human scale conversations evaluation and training.

Abstract: As Large Language Models (LLMs) are deployed to serve open-ended and realistic human-AI interactions, evaluating conversational quality at the human scale has become a central challenge. Existing evaluation frameworks built for summarization, translation, or short-form QA tasks fall short of adequately measuring the consistency of human-scale dialogue, especially when derivation and validation of these metrics themselves often rely on synthetic rather than human sources. We fill the gap by introducing UPHELD, a large, reference-full benchmark for evaluating human-scale conversational ability beyond factual correctness. UPHELD consists of hundreds of multi-turn human-to-human dialogues authored by \textbf{professional script writers}, with realistic turn densities and 36,000+ per-turn human annotations across 10,000+ expert-generated dialogue turns. We also show that naive quality metrics like LLM-as-a-judge perform poorly on UPHELD, but it can be used as a fine-tuning dataset or a validation dataset to develop more robust LLM evaluation metrics in these settings. Overall, UPHELD provides a robust, human-grounded foundation for evaluating long, human-scale conversational intelligence that fills a crucial gap in the pre-existing LLM dataset landscape.

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 73

Loading