Keywords: LLM-as-judge, multi-turn evaluation, conversational AI evaluation, AI interviewer evaluation, human-AI alignment
TL;DR: MELISSA enables any LLM to become a reliable conversation evaluator through hierarchical decomposition and learned bias correction, revealing that calibration—not model size—determines evaluation quality.
Abstract: As AI systems increasingly conduct complex multi-turn interactions, reliable evaluation becomes critical yet challenging. Current LLM-as-judge approaches suffer from severe biases and struggle with lengthy conversations, while monolithic evaluation misses quality variations across dialogue segments. We present MELISSA, a framework that hierarchically decomposes conversations and learns bias corrections from human judgments, requiring no model fine-tuning. Evaluating 100 AI-conducted technical interviews with expert annotations reveals surprising insights: bias correction alone reduces error by over 50\%, indicating LLMs struggle with scale calibration rather than quality discrimination; properly aligned GPT-4o-mini outperforms unaligned Claude 3.5, enabling order-of-magnitude cost reductions; and optional audit mechanisms show mixed results—while potentially providing confidence signals, they often introduce unnecessary edits that degrade performance when models already produce well-calibrated evaluations, highlighting the importance of empirical validation over intuitive design. These findings demonstrate that simple calibration transforms weak models into reliable judges, while reinforcing that high-quality human data remains essential for automated evaluation systems.
Submission Number: 215
Loading