Style over Substance: LLM-as-a-Judge Fails to Evaluate Multi-Party Social Dialogue

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop ICBINBEveryoneRevisionsCC BY 4.0
Keywords: LLM-as-a-Judge, Dialogue Evaluation, Multi-Party Dialogue, Human Evaluation, Conversational AI
TL;DR: LLM-as-a-Judge frameworks fail to align with human preferences for multi-party social dialogue, prioritizing surface style over conversational coherence.
Abstract: The evaluation of multi-party social dialogue remains a significant challenge due to the complexity of turn-taking, distinct personas, and open-ended objectives. A widely adopted solution is to use instruction-tuned Large Language Models (LLMs) as automated judges, under the assumption that sufficiently capable models can approximate human preferences at scale. In this work, we present a negative result demonstrating that state-of-the-art LLM judges (including GPT-5.2 and Gemini 3.0 Flash) fail to align with human judgments in this domain, achieving near-random agreement (Cohen's $\kappa \approx 0.11-0.17$). Through controlled ablations and stress tests, we isolate the mechanism of this failure: judges act as \textit{style classifiers} rather than discourse evaluators. We show that while judges can detect extreme topic drift, they prefer ``assistant-style" utterances over natural dialogue. Our findings expose a critical limitation of LLM-as-a-Judge frameworks for social interaction and caution against optimizing dialogue systems using evaluators that are blind to interactional coherence.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 20
Loading