Information-Theoretic Questionnaire Construction for Consistent Evaluation of Subjective Tasks with LLMs
Keywords: LLMs, Evaluation of Subjective Tasks, Expected Information Gain
TL;DR: Beyond Drift: Stabilizing Subjective LLM Evaluation with Information-Theoretic Rubrics
Abstract: Despite the growing use of large language models (LLMs) in subjective tasks such as role-playing, humor, emotional intelligence, and dialogue quality, their evaluation faces a pressing \textbf{reproducibility crisis}: even the same evaluator may contradict itself when re-judging the exact same sample.
We attribute this instability to {dimension drift}, where free-form evaluation protocols (e.g., Chain-of-Thought reasoning) unpredictably shift their implicit criteria, undermining reliability.
To address this fundamental challenge, we {reformulate subjective evaluation as an information-theoretic optimization problem}. Specifically, we propose an \textbf{Expected Information Gain (EIG)-based framework} that constructs a stable yet adaptive personalized rubric to eliminate dimension drift.
Our two-stage “generate–then–score” design first produces a diverse pool of candidate evaluation questions and then selects the most informative subset via EIG, yielding explicit and repeatable criteria.
Experiments on six benchmarks, including CharacterEval, The rJokes, and {MT\_bench}, demonstrate that our approach substantially improves both evaluation {consistency} and {alignment with human judgments}, outperforming CoT-based and fixed-questionnaire baselines.
These results highlight that information-theoretic questionnaire construction offers a principled and reliable path toward reproducible evaluation of subjective tasks.
Primary Area: interpretability and explainable AI
Submission Number: 9502
Loading