Ask Me Like I'm Human: LLM-based Evaluation with For-Human Instructions Correlates Better with Human Evaluations than Human Judges

Published: 05 Jun 2025, Last Modified: 11 Jun 2025TRL@ACL2025EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Evaluation; LLM as Judge
Abstract: Human evaluation in NLP has high cost and expertise requirements, and instruction-tuned LLMs are increasingly seen as a viable alternative. Reported correlations with human judgements vary across evaluation contexts and prompt types, and it is hard currently to predict if an LLM-as-judge metric will work equally well for new evaluation contexts and prompts, unless human evaluations are also carried out for comparison. Addressing two main factors contributing to this uncertainty, model suitability and prompt engineering, in the work reported in this focused contribution, we test four LLMs and different ways of combining them, in conjunction with a standard approach to prompt formulation, namely using written-for-human instructions verbatim. We meta-evaluate performance against human evaluations on two data-to-text tasks, and eight evaluation measures, also comparing against more conventional LLM prompt formulations. We find that the best LLM (combination)s are excellent predictors of mean human judgements, and are particularly good at content-related evaluation (in contrast to form-related criteria such as Fluency). Moreover, the best LLMs correlate far more strongly with human evaluations than individual human judges across all scenarios.
Include In Proceedings: Yes
Submission Number: 10
Loading