Keywords: dialogue evaluation, information gain, multi-turn dialogue, uncertainty reduction, information-seeking dialogue, embedding-based metrics, LLM evaluation, efficient evaluation
Abstract: Evaluating multi-turn dialogue systems remains challenging, as dialogue quality depends on how
effectively an agent accumulates relevant information across turns.
In this work, we propose a fast, information-theoretic metric for evaluating multi-turn dialogue
based on uncertainty reduction over the course of a conversation in embedding space.
Our approach admits a tractable Gaussian approximation and enjoys desirable theoretical properties,
including monotonicity, telescoping over turns, and submodularity.
Unlike recent approaches that rely on large language models as judges, our method is fully
reference-free (no ground-truth answers, no gold references, no human annotations at evaluation time),
deterministic, and computationally efficient. We show that the proposed metric remains effective even when instantiated with extremely lightweight
embedding models under CPU-only execution, indicating that the evaluative signal does not require
large model capacity or autoregressive inference.
We evaluate the proposed metric on MT-Bench and Chatbot Arena, showing competitive and, on
MT-Bench, improved agreement with human preferences compared to several LLM-as-a-judge baselines.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: evaluation and metrics
Contribution Types: Approaches low compute settings-efficiency, Theory
Languages Studied: English
Submission Number: 987
Loading