[Short] Evaluating Language Models in Realistic Conversational Contexts

Published: 02 Mar 2026, Last Modified: 02 Mar 2026ICLR 2026 Workshop DATA-FMEveryoneRevisionsCC BY 4.0
Keywords: dataset, dialogue, llm-as-a-judge
TL;DR: High quality, human expert created, dataset for human scale conversations evaluation and training.
Abstract: As Large Language Models (LLMs) are increasingly deployed to serve open-ended and realistic human-AI interactions, evaluating conversational quality at the human scale has become a central challenge. Existing evaluation frameworks built for summarization, translation, or short-form QA tasks fall short of adequately measuring the consistency of human-scale dialogue, especially when derivation and validation of these metrics themselves often rely on synthetic rather than human sources. We fill the gap by introducing UPHELD (Utility \& Planning Human-Scale Evaluated Long Dialogues), a large, reference-full benchmark for evaluating human-scale conversational ability beyond factual correctness. UPHELD consists of hundreds of multi-turn human-to-human dialogues authored by \textbf{professional script writers}, with realistic turn densities and 36,000+ per-turn human annotations across 10,000+ expert-generated dialogue turns. We also show that naive quality metrics like LLM-as-a-judge perform poorly on UPHELD, but it can be used as a fine-tuning dataset or a validation dataset to develop more robust LLM evaluation metrics in these settings. Overall, UPHELD provides a robust, human-grounded foundation for evaluating long, human-scale conversational intelligence that fills a crucial gap in the pre-existing LLM dataset landscape.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 73
Loading