Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Khaoula Chehbouni; Mohammed Haddou; Jackie CK Cheung; Golnoosh Farnadi

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Khaoula Chehbouni, Mohammed Haddou, Jackie CK Cheung, Golnoosh Farnadi

Published: 26 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 Position Paper TrackEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLMs as judges, natural language generation evaluation, measurement theory, responsible AI

TL;DR: In this position paper we investigate the validity and reliability of LLMs as judges and highlight challenges inherent to their use and existing practices in NLG evaluation.

Abstract: Evaluating natural language generation (NLG) systems remains a core challenge, further complicated by the rise of general-purpose large language models (LLMs). Recently, large language models as judges (LLJs) have emerged as a scalable, cost-effective alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation. To ground our analysis, we explore three applications of LLJs at various stages of the machine learning pipeline: text summarization, data annotation and safety alignment. Finally, we highlight the need for more responsible evaluation practices in LLJs evaluation, to ensure that their growing role in the field supports, rather than undermines, progress in NLG.

Lay Summary: This position paper argues that the current enthusiasm around LLMs as judges (i.e., LLMs as evaluation metrics) may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. We explore the use of LLMs as judges in three popular applications: text summarization, safety alignment, and data annotation. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions surrounding the use of LLMs as judges: (1) their ability to act as proxies for human judgment, (2) their capabilities as evaluators, (3) their scalability, and (4) their cost-effectiveness.

Submission Number: 720

Loading