From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge

ACL ARR 2025 May Submission6253 Authors

20 May 2025 (modified: 29 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Automatic model assessment has long been a critical challenge. Traditional methods, usually matching-based or small model-based, often fall short in open-ended and dynamic scenarios. Recent advancements in Large Language Models (LLMs) inspire the ``LLM-as-a-judge'' paradigm, where LLMs are leveraged to perform scoring, ranking, or selection for various machine learning evaluation scenarios. This paper presents a comprehensive survey of LLM-based judgment and assessment, offering an in-depth overview to review this evolving field. We first provide the definition from both input and output perspectives. Then we introduce a systematic taxonomy to explore LLM-as-a-judge along three dimensions: \textit{what} to judge, \textit{how} to judge, and \textit{how} to benchmark. Finally, we also highlight key challenges and promising future directions for this emerging area. We have released and will maintain a paper list about \textbf{LLM-as-a-judge} at: \url{https://anonymous.4open.science/r/Awesome-LLM-as-a-judge-266D}.

Paper Type: Long

Research Area: Resources and Evaluation

Research Area Keywords: Large Language Models, LLM-as-a-judge, Model Evaluation

Contribution Types: Model analysis & interpretability, Surveys

Languages Studied: English

Submission Number: 6253

Loading