Information-Theoretic Questionnaire Construction for Consistent Evaluation of Subjective Tasks with LLMs

Shan Zhao; Wang Xu; Tianwei Yan; Chengyu Wang; Haolan Chen; Shizhao Chen; Qian Wan; Meng Wang

Information-Theoretic Questionnaire Construction for Consistent Evaluation of Subjective Tasks with LLMs

Shan Zhao, Wang Xu, Tianwei Yan, Chengyu Wang, Haolan Chen, Shizhao Chen, Qian Wan, Meng Wang

17 Sept 2025 (modified: 12 Feb 2026)ICLR 2026 Conference Desk Rejected SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs, Evaluation of Subjective Tasks, Expected Information Gain

TL;DR: Beyond Drift: Stabilizing Subjective LLM Evaluation with Information-Theoretic Rubrics

Abstract: Despite the growing use of large language models (LLMs) in subjective tasks such as role-playing, humor, emotional intelligence, and dialogue quality, their evaluation faces a pressing \textbf{reproducibility crisis}: even the same evaluator may contradict itself when re-judging the exact same sample. We attribute this instability to {dimension drift}, where free-form evaluation protocols (e.g., Chain-of-Thought reasoning) unpredictably shift their implicit criteria, undermining reliability. To address this fundamental challenge, we {reformulate subjective evaluation as an information-theoretic optimization problem}. Specifically, we propose an \textbf{Expected Information Gain (EIG)-based framework} that constructs a stable yet adaptive personalized rubric to eliminate dimension drift. Our two-stage “generate–then–score” design first produces a diverse pool of candidate evaluation questions and then selects the most informative subset via EIG, yielding explicit and repeatable criteria. Experiments on six benchmarks, including CharacterEval, The rJokes, and {MT\_bench}, demonstrate that our approach substantially improves both evaluation {consistency} and {alignment with human judgments}, outperforming CoT-based and fixed-questionnaire baselines. These results highlight that information-theoretic questionnaire construction offers a principled and reliable path toward reproducible evaluation of subjective tasks.

Primary Area: interpretability and explainable AI

Submission Number: 9502

Loading